Super paramagnetic clustering of protein sequences
© Tetko et al; licensee BioMed Central Ltd. 2005
Received: 29 September 2004
Accepted: 01 April 2005
Published: 01 April 2005
Detection of sequence homologues represents a challenging task that is important for the discovery of protein families and the reliable application of automatic annotation methods. The presence of domains in protein families of diverse function, inhomogeneity and different sizes of protein families create considerable difficulties for the application of published clustering methods.
Our work analyses the Super Paramagnetic Clustering (SPC) and its extension, global SPC (gSPC) algorithm. These algorithms cluster input data based on a method that is analogous to the treatment of an inhomogeneous ferromagnet in physics. For the SwissProt and SCOP databases we show that the gSPC improves the specificity and sensitivity of clustering over the original SPC and Markov Cluster algorithm (TRIBE-MCL) up to 30%. The three algorithms provided similar results for the MIPS FunCat 1.3 annotation of four bacterial genomes, Bacillus subtilis, Helicobacter pylori, Listeria innocua and Listeria monocytogenes. However, the gSPC covered about 12% more sequences compared to the other methods. The SPC algorithm was programmed in house using C++ and it is available at http://mips.gsf.de/proj/spc. The FunCat annotation is available at http://mips.gsf.de.
The gSPC calculated to a higher accuracy or covered a larger number of sequences than the TRIBE-MCL algorithm. Thus it is a useful approach for automatic detection of protein families and unsupervised annotation of full genomes.
Numerous genome-sequencing projects have caused a rapid growth of the protein databases. In contrast to the pre-genomic era, when the selection of sequences was highly biased towards known and characterized genes, the systematic exploration of genomes now allows to assign more and precise functional properties in the majority of cases. However, manual annotation of sequences is laborious and expensive. Thus, there is a strong interest in developing reliable methods for the automatic functional classification of genome sequences employing evolutionary sequences as reflected in using sequence homology to predict functional properties. The identification of protein families, defined as set of proteins with significant sequence similarity encoding for at least related but often identical function between members, is a very important subtask to achieve this fundamental goal. Indeed, the fact that proteins with high sequence similarity share a common evolutionary history is accepted as the basis for functional assignment .
Among the different methods proposed to organize the sequence space into protein families, several approaches based on clustering using sequence similarity scores were successfully established (see e.g., [2–4]). However, the multi-domain composition of proteins, as well as the presence of promiscuous domains can influence the accuracy of such methods. Recently, an efficient algorithm for large-scale detection of protein families based on the Markov cluster algorithm, TRIBE-MCL, was proposed . This algorithm simulates random walks within a graph by iterative alternation of two operators called expansion (explores intra-cluster structure) and inflation (eliminates flow between different clusters). In comparison to other clustering algorithms, the TRIBE-MCL produces clusters that resist contamination by promiscuous domains that could provide significant problems for other clustering algorithms as is discussed elsewhere . TRIBE-MCL was tested using large databases of manually annotated protein sequences such as SwissProt  and SCOP  and has already been widely used in bioinformatics (about 50 publications referred to this algorithm since its publication in 2002 according to ). Thus this method is one of the recognized bioinformatic tools and its results can be used as an established standard for comparison of new algorithms. Moreover, we have already used the TRIBE-MCL algorithm for the analysis of the SIMAP database .
Several clustering methods have appeared in recent years. One of these, the Super Paramagnetic Clustering (SPC) has received considerable attention in microarray data analysis [11, 12]. This algorithm provides clustering of input data  based on analogy to the physics of an inhomogeneous ferromagnet. The method detects natural (physical) clusters present in the data and is able to efficiently cluster difficult test examples, such as concentric circles. The SPC algorithm was also successfully used in a supervised setting to analyze protein sequences and classify SCOP and CATH proteins according to their FSSP scores . Following our first successful application of SPC to a database of RING-finger domains  and our approach to project expression data to known functional modules , the present study further investigates the power of SPC to cluster protein sequences of two large databases, SwissProt and SCOP. We compare its performance with the TRIBE-MCL algorithm. Since both these databases do not contain complete genome sequences required for an unbiased comparison of the methods, we additionally analyzed protein sequences from four bacterial genomes, namely Bacillus subtilis, Helicobacter pylori, Listeria innocua and Listeria monocytogenes manually annotated at MIPS according to FunCat [17–19]. We also introduce an extension of this algorithm, global SPC or gSPC, which performs step-wise clustering on different levels of connectivity between points and provides significantly improved performance to the annotation of whole genomes compared to both the original SPC algorithm and TRIBE-MCL.
Clustering of a simulated data
Figure 1 demonstrates that SPC (K = 10) was able to correctly determine the presence of three clusters in the data. Two splits of clusters at temperature T = 0.054 and T = 0.084 are observed. The first split corresponds to a separation of clusters formed by the 1 st and two other distributions. The second split corresponds to a separation of clusters formed by 2 nd and 3 rd distributions. Following these two breaks, the cluster melts on singletons. Thus, the hierarchical structure of data was uncovered and physical clusters present in these data were found. The noise between the data points from distributions 1 and 2 (green circles) did not affect the clustering results.
In contrast to SPC, the TRIBE-MCL algorithm has some difficulties in correctly determine the structure of the data. For example, for inflation parameter 2.1, the algorithm subclusters points generated by normal distributions 1 into 2 different subclusters. For the inflation value of 5, one can already observe 5 different clusters. One of the largest clusters contains 21 points, including 6 and 15 points generated by distribution 2 and 3, respectively. This cluster remains stable even for inflation parameter 20. Thus TRIBE-MCL could not detect the physical structure of this data set. Of course, one should not draw a general conclusion about the relative performance of both algorithms following only a single simulated example.
Comparison of algorithms using SwissProt and SCOP databases
The performance of the algorithms was investigated using SwissProt  and SCOP [3, 8] databases. The accuracy of SPC clustering for the SwissProt database was assessed by analysis of InterPro domains  and Swiss keywords of members in calculated clusters. Sequences without any annotation were used for data clustering but were not considered to estimate the performance of the methods. Ideally, all members of each detected cluster should have exactly the same annotation in terms of InterPro domains and Swiss keywords.
Analogous to the previous analysis  only clusters that contained at least 4 or more annotated sequences were considered. The domain (or keyword) combination detected for ≥50% annotated sequences in the cluster was used as the (consensus) annotation of the cluster. Since some proteins had more than one domain (or keyword), we measured the performance of the method by counting the number of correctly assigned domains rather than the number of correctly classified proteins. This procedure avoids ambiguities in cases where, for instance, the annotation of three out of five domains was predicted correctly. The number of true positive (TP) domains/keywords was determined as the count of domains/keywords that coincided with the cluster annotation. The number of false negatives (FN), i.e. domains/keywords observed for a particular protein but absent in the cluster annotation, and false positive (FP), i.e. domains/keywords presented in the cluster annotation but missed for some particular proteins, were calculated. These numbers were used to compute the sensitivity = TP/(TP + FN) and specificity = TP/(TP + FP) of the clustering algorithms analyzed. The sensitivity is equivalent to the probability of correctly predicting some classifier while specificity is defined as the probability that the provided prediction is correct .
Not all proteins initially used in the evaluation will get a chance to be annotated by clustering. Some proteins will not be clustered at all, because they do not have significant hits to other proteins. These proteins can be either treated as false negatives (indeed, their categories were not predicted) or simply excluded from the analysis (since they cannot be clustered and the clustering algorithm explicitly "refused" to annotate them). The sensitivity that is calculated taking clustered proteins appears to be more relevant. Indeed, if a protein was clustered, the sensitivity determined in our study will indicate how many of the existing categories of the protein analyzed are expected to be correctly predicted. This definition deals with a posteriori sensitivity, i.e. it should be used only after clustering of the protein families. The sensitivity determined by considering all non-clustered proteins as false positive, corresponds to a priori sensitivity. Indeed, this number indicates how many categories of the given protein will be correctly predicted when there is no knowledge if a protein will be clustered or not. Since each of these two definitions of sensitivity has its own advantage (e.g. the later allows for a more straightforward comparison of methods) we calculated them both. Notice, that this definition of specificity does not include FN and thus it is not affected by which definition is used.
SCOP database analysis
Sequences from the PDB database  (Release from 01/07/2003) were clustered after removing redundant entries. These sequences were annotated using the SCOP database v. 1.63 [3, 8]. The TRIBE-MCL results were calculated using the inflation value of 5 . The total number of proteins used for analysis was 15,605 and 12,961 of the sequences had assigned SCOP domains. The total number of manually curated domains was 13,070 domains. Both annotated and non-annotated proteins were used for clustering. Obviously the method performance was calculated only for the annotated cases.
Clustering of PDB sequences using SPC, gSPC and TRIBE-MCL algorithms
a posteriori sensitivity
a priori sensitivity
Thus, the performance of the clustering method using all connections did not dramatically decrease due to the addition of the new sequences, but rather due to worse prediction of some sequences that were already clustered using K = 20. Therefore by joining results calculated using variable K values and preserving results calculated for the sequences that were clustered in each preceding step, one can expect to increase both sensitivity and specificity of the method. Indeed, the use of the gSPC method provided a considerable increase in clustering performance. The number of false positive and false negative for the "all NN" clusters was lower than the numbers calculated using fixed value of K = 20, but as many as 610 new sequences were covered. The gSPC method outperformed the TRIBE-MCL and SPC in terms of both a priori and a posteriori sensitivities. Thus the performance of the gSPC was considerably better than the other methods both in terms of covered sequences and quality of annotation. This improvement is of great importance for an automatic annotation of protein sequences.
Analysis of InterPro domain composition in clusters calculated for SwissProt database
a posteriori sensitivity
a priori sensitivity
Analysis of SwissProt keywords composition in clusters calculated for the SwissProt database
a posteriori sensitivity
a priori sensitivity
Comparison of algorithms for annotation of bacterial genomes
For this analysis 11,502 protein sequences from four completely sequenced genomes, Bacillus subtilis, Helicobacter pylori, Listeria innocua and Listeria monocytogenes, were used. The annotation of the genomes was done at MIPS using FunCat 1.3 . The FunCat is an annotation scheme for the functional description of proteins from prokaryotes, unicellular eukaryotes, plants and animals [17, 18]. Taking into account the broad and highly diverse spectrum of known protein functions, FunCat consists of 28 main functional categories (or branches) that cover general fields like cellular transport, metabolism and signal transduction. The main branches exhibit a hierarchical, tree like structure with up to six levels of increasing specificity. In total, the FunCat 1.3 included 1,445 functional categories and a total number of 403 distinct categories were available for analyzed bacterial genomes. The manual functional classifications were presented for 6,354 proteins.
An estimation of performance for proteins that have similar but not exact annotation represent some challenge. Let us consider an example of a cluster containing three proteins. The first protein that has annotation
01.01.01.01.02, biosynthesis of the glutamate group
01.05.01, C-compound and carbohydrate utilization
the second protein has annotation
01.01.01.01.02.01, biosynthesis of proline
and the third protein is annotated with one category only:
01.01.01.01.02, biosynthesis of the glutamate group
The annotation of all three proteins is similar but the annotation of the second protein is more detailed for the metabolism (01) category while the first protein contains additional category 01.05.01. The third protein does not have an annotation category 40, subcellular localization.
We measured the performance of annotation by counting the number of all non-redundant subcategories, i.e. 01, 01.01, 01.01.01, etc. In the above example the consensus annotation of this cluster is 01.01.01.01.02 and 40.03 categories (including all their sub-categories). The number of TP annotations is 19 = 7 + 7 + 5. The number of FN is 3. These are sub-categories 01.05, 01.05.01, and 01.01.01.01.02.01, for the first and second proteins, respectively. There are also two FP annotations, 40 and 40.03 observed for the third protein.
Clustering of sequences of bacterial genomes using SPC, gSPC and TRIBE-MCL algorithms
a posteriori sensitivity
a priori sensitivity
Comparison of different clustering algorithms
Analyzed data set
SwissProt InterPro domains
Bacterial genomes, FunCat 1.3
We have described and demonstrated the use and performance of SPC and of its extension, gSPC, for the clustering of protein sequences using sequence similarity only. For the first time, the SPC algorithms for clustering of protein sequences was employed in a large-scale study. Our results confirm that this method is a valuable, reliable tool for the automatic functional classification of protein sequences.
The use of the step-wise clustering approach, gSPC, improved sensitivity and specificity of the original method and allowed us to get a higher accuracy compared to the TRIBE-MCL and original SPC algorithms using data sets from PDB and SwissProt databases. The performance of the gSPC for annotation of Swiss Keywords favorably compares with supervised approaches. For example, the percentage of sequences covered by gSPC for 93.7% (K = 6) and 93% (K = 20) specificity were 62% and 79% respectively. The first number is similar to that calculated by the supervised classification approach using the C4.5 algorithm, where 58% of sequences were covered with 94% of specificity . Thus the SPC algorithm classification performance is comparable to the state-of-art supervised machine learning classification approach used by . Notice, that the latter method used different sequence attributes, such as PFAM domain and PROSITE composition, and thus explicitly took into consideration the multi-domain organization of proteins. The SPC algorithm, contrary to that, used the sequence similarity only. This result indicates high prediction ability of the annotation using gSPC clustering.
The specificity of the gSPC algorithm using all connections ("all-NN"), 92.8%, is also slightly superior to the results of another supervised approach, the adaptive algorithm for automated annotation  that calculated 90.4%. These results, however, are indicative only since our and previous approaches are different with respect to their types (i.e. supervised and unsupervised ones). In addition similar but not identical datasets were used. For example, the performance of the adaptive algorithm  was tested using 500 probe sequences, randomly chosen from SwissProt while the performance of the C4.5 algorithm was tested using 10-fold cross-validation.
A significant advantage of unsupervised clustering approaches over the supervised ones is the ability of the former methods to detect as yet unobserved relations between proteins. Such results could be used to find new protein families that currently do not have functional annotations or have incomplete or inconsistent ones. The unsupervised methods are also not subject to the risk of overfitting problems. Overfitting impairs the predictive power of supervised approaches for samples that were not represented in the training set [25–27].
The gSPC resulted in higher sensitivity and specificity for all analyzed datasets. The use of a priori sensitivity made it possible to compare results of all methods using a fixed number of samples and made the comparison more straightforward. However, from a practical point of view, particularly for the implementation of annotation pipe lines of complete genomes the specificity and a posteriori sensitivity rather than a priori sensitivity are the most relevant factors of the automatic annotation. Indeed, it is preferable to annotate a smaller number of sequences in automatic mode with a higher accuracy than to annotate more of them with a lower accuracy. The sequences not automatically annotated by the algorithm can be annotated subsequently by complementary approaches or by careful manual analysis of domain structure of the sequences. Any attempt to increase the a priori sensitivity and thus cover a larger number of sequences by lowering specificity may result in an unacceptable performance for annotation purposes.
The SPC algorithm calculates hierarchical tree-structures of clusters for each K value. In our analysis we identified and considered only the leaf clusters. The upper part of the tree was ignored. Such analysis was possible since for the data analyzed in this study the largest number of protein sequences were clustered in leafs and only a few additional sequences could be still clustered considering the whole tree structure. For example, using K = 6 (Table 1 &2, SPC results) only 56 and 627 additional sequences could be clustered for the PDB and SwissProt data sets, respectively. These numbers corresponded to about 1% sequences in each database and only marginally influenced the method performance. The gSPC algorithm, as it was already mentioned in the Method section, clustered such sequences at higher K values. This improved its prediction performance compared to the original SPC method.
The performance of any algorithm analyzing sequence relations depends on the selection of the reference data sets. The composition of such references data sets could bias the performance, since in reality the selection of sequences from known genomes or databases is not a representative random sample from the sequence space. For example, SwissProt and SCOP databases are very often used as "a gold standard" for annotation and classification methods. However, these databases represent a biased selection and do not cover entire genomes. Therefore, analysis of the performance of clustering methods is biased by the composition of the reference data sets. For example, the gSPC decreased missassignments by 2–5% for these two sets. Since the error rate for clustered sequences from these data sets was about 10–15% (Table 5), the relative gain in the performance was 10–40%. In other words, the automatic annotation of sequences clustered with gSPC will make up to 40% fewer erroneous annotations (false positive or false negative annotations) compared to other methods. On the other hand, all three methods returned similar results for the bacterial genomes. For this set gSPC covered about 12% of additional sequences which is very important for genome annotation.
The gSPC method developed is fast. A complete analysis of PDB and SwissProt datasets using all K levels took on moderate PC computer (AMD 1.3 GHz) less than 14 and 120 minutes, respectively. The speed of the original SPC algorithm scales approximately linearly with the number of connections. This number increases approximately as N2 with the number of samples. However, since gSPC uses step-wise clustering, the actual number of samples remaining for clustering using large K values is small. Therefore, the gSPC speed is mainly determined by clustering using small, K = 2–6, values and is in practice approximately linear with the number of samples. In fact, the method will be fast enough to efficiently cluster datasets with millions of sequences; clustering computational requirements is therefore small compared to the computation of the sequence similarity scores. The computational efficiency is clearly an advantage of the gSPC method compared to the TRIBE-MCL. The complexity of the later scales as O(N2) were N is number of non-zero elements in the matrix. Therefore, this method performs slower if an increasing volume of genomic data needs to be processed.
The gSPC calculated with higher accuracy or covered a larger number of sequences than the TRIBE-MCL algorithm for the analyzed datasets. The accuracy of annotation of gSPC for the SwissProt database was comparable to that of supervised methods. Thus it is a useful approach for automatic detection of protein families and annotation of full genomes.
Clustering of sequences using the properties of super paramagnetic systems
SPC performs a hierarchical clustering strategy . The input data for SPC are represented as a distance matrix d ij (specified by the user) between data points i = 1,...,N. This matrix is used to construct a graph, whose vertices are the data points and edges corresponding to the connections between neighboring points. Each two points are considered to be neighbors (and thus have an edge), if they are within K nearest neighbors of each other (K -mutual-neighbor criterion). In the ferromagnetic model each point is considered to have a Potts spin, equivalent to one of q integer values, s = 1,2,...q. Pairs of neighboring points i and j that have the same spin s i = s j are interacting with strength J ij , which is a function of an initial distance matrix d ij .
A Monte Carlo simulation using the Swendsen-Wang method (SW)  is used to determine clusters of points. The simulation starts with assigning a random Potts spin to each data point in the dataset. The neighboring points with identical spins interact with each other. The probability that these points belong to the same SW cluster (i.e., instant cluster resulting from an iteration) is calculated as P ij = (1 - exp(-J ij /T)), where T is some fixed temperature. The points from the same SW cluster receive identical spin (selected by chance) for the next iteration. The iterations are performed until convergence observed using, e.g. autocorrelation time .
For clustering, the strengths of the edges of the network are calculated using the spin-spin correlation function G ij . This function is estimated as the ratio of iterations when points i,j belong to the same SW cluster versus the total number of iterations during the simulation. Notice, that if P ij values determine an instant probability of two points with the same spin to belong to the same instant cluster (i.e., local connectivity), the spin-spin correlation function takes into account the multiple interactions amid points, i.e. global connectivity of the network. Actually the global connectivity determines the probability two points will have the same spin. At very low temperatures even small J ij cause neighboring points to have the same spin, G ij ≅ 1, i.e. the system is in the ferromagnetic state. If the temperature is very high then all J ij <<T and the probability of finding each pair of neighboring points in the same state decreases to the value expected by chance G ij ≅ 1/q. Thus the system is unordered, i.e. it is in the paramagnetic state. If coupling values are equal J ij = J, only two these states are stable. The system has a sharp transition from the ferromagnetic to the paramagnetic state at a certain temperature.
In case of an inhomogeneous system, some magnetic "grains" could be observed. Such clusters of points have strong connections between the members and only weak connections to all points outside the cluster. Each grain has its own transition temperature from the ferromagnetic to the paramagnetic state. At some temperature range, the so-called superparamagnetic state, the system can contain some points in a locally ordered ferromagnetic state and as well as others in a locally unordered paramagnetic state. At the end of simulation, all points that had spin-spin correlation G ij > 0.5 form one cluster. The points that had G ij < 0.5 can be connected to their neighbors with maximal G ij . The results of the analysis at different temperatures are combined and provide a hierarchical clustering of data. A more detailed description of the SPC can be found elsewhere .
Selection of free parameters of SPC algorithm
Free parameters of the SPC algorithm include the number of spins q in the Potts model, the number of nearest neighbors K for the K-mutual-neighbor criterion, and the number of iterations I for the SW algorithm. An increase of q required larger numbers of iterations for convergence but otherwise did not affect the performance of the algorithm . For q = 20 the convergence of SW algorithm was usually observed using I = 1000 iterations for the vast majority of the cases investigated. The same values also resulted in convergence of the algorithm for the datasets used in our study.
As described above, SW simulations start with random spin configuration. After subsequent iterations, the algorithm converges into a stable state that does not depend on the initial configuration. Given convergence at a certain temperature, one can expect that the configuration should be similar to the converged state of the next step in respect to the increased temperature. Thus if SW simulations are started from a converged state calculated for a similar temperature, the convergence to the new state should be faster. Without loss of precision, we used this fact to speed up SW simulations. The configuration after completion of each simulation cycle was saved and used it as the initial one for the next temperature. It was possible to decrease the number of iterations from I = 1000 to I = 200 without affecting the quality of the results.
Global clustering and the gSPC algorithm
The K-mutual-neighbor criterion controls the resolution of the clustering. A low K value takes into consideration only the closest pairs of mutually connected points ("i.e. short connections"). As its values increases, long-range connections are also taken into consideration and they participate in the clustering process. In previous studies this parameter was selected in the range of 5–20 [13, 30]. In a recent study  an algorithm to determine optimal K values was also proposed. The optimal value of K corresponds to the maximum stable partitions characterized by maximum widths of the superparamagnetic domains. The optimization of K is critical in absence of reliable classification information to allow for supervised learning. In this study, reliable protein family assignment for test and training data was available. Thus, it was possible to directly analyze the performance of the SPC algorithm as a function of the parameter K.
Using a constant value for K is reasonable if the data set is rather homogeneous. In this case, the data points have the same connectivity on average. In reality, the problem of sequence based family assignment is complex. Some protein families apparently underlay constraints concerning the variation of their sequence and contain a large number of highly similar sequences while other families contain only few members displaying low sequence similarity. Using K values optimal for conserved families may bias the clustering process and lower the sensitivity to discover the more diverging ones. Thus, it is inappropriate to use a single K -value for clustering of all data.
In this article we introduce a step-wise analysis using different K -values. The data were consequently clustered using K = 2, 6, 20, 64 and K = "all" connections. This choice of K approximately corresponded to a 10-fold increase in the data connectivity that is proportional to K2. For each K we varied temperature from a minimum, T = 10-4, to a maximum, T = 2, values. At the maximum value of T no cluster was observed. As soon as a subset of sequences was clustered using some K, they were removed from any further analysis except one representative sequence. The representative sequence was the one that had the minimal sum of distances to the other sequences in the cluster. These representative sequences as well as all other remaining non-clustered sequences were used for the next round, i.e. larger K, of analysis. Such bottom-up clustering allowed us to cover different types of clusters covering a wide range of homogeneity and to improve the performance of the method as shown in the Result section. In order to distinguish the initial SPC algorithm from the step-wise based approach proposed in this study, we will refer to the latter as the global SPC (gSPC).
Processing of hierarchic data
The SPC algorithm provided a hierarchical clustering of data on each step of the gSPC analysis. Let us consider the clustering procedure of increasing the temperature. For T = 0 all sequences form one cluster that starts to melt and break into subclusters as the temperature increases. Clusters of different sizes can appear and then melt at higher temperatures. Some clusters may contain only 2–3 sequences while other clusters consist of more than a thousand sequences. A number of clusters will lose one or several members for some particular temperatures while others will break into two or more sub-clusters. In case the data set contains only few clusters, one can inspect the cluster size or susceptibility, that is the variance of magnetization of the system as a function of temperature [13, 30]. If different stable sub-phases are determined, e.g. by changes in the cluster sizes, it is possible to select a particular temperature in each region of interest for further analysis. Such a method, however, is unfeasible for the purpose of our study when as many as thousands clusters can be simultaneously observed for each particular temperature. Thus, the cluster composition for each temperature step should be considered for the analysis and some automated approach for the analysis of such results should be performed. In the following paragraphs we describe a method that performs such an analysis in a completely automatic way.
Let us limit a minimal size of any cluster to some parameter C. All points in clusters that have size less than C members were considered as unclustered points or singletons. The singletons did not form separate clusters but belonged to the cluster from which they were generated. At some particular temperature a cluster can break (melt) into a number of subclusters and singletons. However, if and only if these subclusters contained at least two subclusters each having at least C sequences, the parent cluster was considered to be subdivided into subclusters. Otherwise the cluster was considered to be unchanged. A leaf cluster was a cluster that did not form subclusters (i.e., it broke into singletons). The singletons were clustered again by the same procedure for larger K.
A FASTA file containing sequences that were used for clustering was compared to itself using BLAST  for the SwissProt database implemented in PEDANT . For the bacterial genomes the FASTA pair-wise scores for were retrieved from the SIMAP database .
The all-against-all sequence similarities generated were parsed and used as input for both algorithms. The input data for TRIBE-MCL were represented as -log10 (E-value). The input values for the SPC were distance values calculated as -1./log10 (E-value). Pairwise scores with E-value > 0.1 were excluded from the analysis.
List of abbreviations
Protein Extraction, Description and ANalysis Tool
MIPS Functional Catalog
Markov Cluster Algorithm
Super Paramagnetic Clustering
Similarity Matrix of Proteins
We thank the MIPS annotation group, Goar Frishman, Barbara Brauner, Gisela Fobo, Irmtraud Dunge and Corinna Montrone, for annotation of bacterial genomes. We also thank Sabina Tornow who introduced us the SPC algorithm, Louise Riley and anonymous reviewers for valuable remarks. This work was supported in part by grant 031U212C BFAM (BMBF) to HWM and grant TE 380/1-1 (DFG) to IVT/HWM.
- Hegyi H, Gerstein M: The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. J Mol Biol 1999, 288: 147–164. 10.1006/jmbi.1999.2661View ArticlePubMedGoogle Scholar
- Kriventseva EV, Servant F, Apweiler R: Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters. Nucleic Acids Res 2003, 31: 388–389. 10.1093/nar/gkg035PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29: 22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res 2000, 28: 49–55. 10.1093/nar/28.1.49PubMed CentralView ArticlePubMedGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30: 1575–1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Enright AJ: Computational Analysis of Protein Function Within Complete Genomes. PhD University of Cambridge 2002.Google Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 2002, 30: 264–267. 10.1093/nar/30.1.264PubMed CentralView ArticlePubMedGoogle Scholar
- Scholar Google[http://scholar.google.com/]
- Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, et al.: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 2004, 32: D41–44. 10.1093/nar/gkh092PubMed CentralView ArticlePubMedGoogle Scholar
- Getz G, Levine E, Domany E: Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 2000, 97: 12079–12084. 10.1073/pnas.210134797PubMed CentralView ArticlePubMedGoogle Scholar
- Getz G, Gal H, Kela I, Notterman DA, Domany E: Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics 2003, 19: 1079–1089. 10.1093/bioinformatics/btf876View ArticlePubMedGoogle Scholar
- Blatt M, Wiseman S, Domany E: Superparamagnetic clustering of data. Physical Review Letters 1996, 76: 3251–3254. 10.1103/PhysRevLett.76.3251View ArticlePubMedGoogle Scholar
- Getz G, Vendruscolo M, Sachs D, Domany E: Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins 2002, 46: 405–415. 10.1002/prot.1176View ArticlePubMedGoogle Scholar
- Tetko IV, Kosarev P, Facius A: Analysis of Ring Finger Domains encoded by the Arabidopsis genome using three different clustering algorithms. In EuroQSAR2002 Designing Drugs and Crop Protectants: processes, problems and solutions. Edited by: Ford M, Livingstone D, Dearden J, Van de Waterbeemd H. Bournemouth, UK: Blackwell Publishing; 2003:384–386.Google Scholar
- Tornow S, Mewes HW: Functional modules by relating protein interaction networks and gene expression. Nucleic Acids Res 2003, 31: 6283–6289. 10.1093/nar/gkg838PubMed CentralView ArticlePubMedGoogle Scholar
- Mewes HW, Albermann K, Bahr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, et al.: Overview of the yeast genome. Nature 1997, 387: 7–65. 10.1038/42755View ArticlePubMedGoogle Scholar
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 2004, 32: 5539–5545. 10.1093/nar/gkh894PubMed CentralView ArticlePubMedGoogle Scholar
- Tetko IV, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Fobo G, Ruepp A, Antonov AV, Surmeli D, Mewes HW: MIPS bacterial genomes functional annotation benchmark dataset. Bioinformatics 2005.Google Scholar
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, et al.: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29: 37–40. 10.1093/nar/29.1.37PubMed CentralView ArticlePubMedGoogle Scholar
- Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16: 412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Kretschmann E, Fleischmann W, Apweiler R: Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics 2001, 17: 920–926. 10.1093/bioinformatics/17.10.920View ArticlePubMedGoogle Scholar
- Leontovich AM, Brodsky LI, Drachev VA, Nikolaev VK: Adaptive algorithm of automated annotation. Bioinformatics 2002, 18: 838–844. 10.1093/bioinformatics/18.6.838View ArticlePubMedGoogle Scholar
- Tetko IV, Livingstone DJ, Luik AI: Neural network studies. 1. Comparison of overfitting and overtraining. J Chem Inf Comput Sci 1995, 35: 826–833. 10.1021/ci00027a006View ArticleGoogle Scholar
- Tetko IV, Villa AE, Livingstone DJ: Neural network studies. 2. Variable selection. J Chem Inf Comput Sci 1996, 36: 794–803. 10.1021/ci950204cView ArticlePubMedGoogle Scholar
- Tetko IV: Neural network studies. 4. Introduction to associative neural networks. J Chem Inf Comput Sci 2002, 42: 717–728. 10.1021/ci010379oView ArticlePubMedGoogle Scholar
- Wang S, Swendsen RH: Cluster Monte Carlo Algorithms. Physica A 1990, 167: 565–579.View ArticleGoogle Scholar
- Gould H, Tobochnik J: Overcoming critical slowing down. Computers in Physics 1989, 3: 82–86.View ArticleGoogle Scholar
- Blatt M, Wiseman S, Domany E: Data Clustering Using a Model of Granular Magnet. Neural Computation 1997, 9: 1805–1842.View ArticleGoogle Scholar
- Agrawal H, Domany E: Potts ferromagnets on coexpressed gene networks: identifying maximally stable partitions. Phys Rev Lett 2003, 90: 158102. 10.1103/PhysRevLett.90.158102View ArticlePubMedGoogle Scholar
- Protein Sequence Clustering – TribeMCL[http://www.ebi.ac.uk/research/cgg/tribe]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D: The PEDANT genome database in 2005. Nucleic Acids Res 2005, 33: D308–310. 10.1093/nar/gki019PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.