Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: In silico approach to designing rational metagenomic libraries for functional studies

Fig. 1

Classification of GOS proteins into families based on existing HMMs from PFAM and TIGRFAMs and de novo MCL clustering. a 72% (4,436,387) of the protein sequences in the GOS dataset could be distributed into families based on existing HMMs obtained from PFAM and TIGRFAMs. The remaining 28% (1,687,008) of protein sequences were distributed into MCL-based classes. Of these classes, 680,484 (11% of the total GOS protein sequences) were considered bona fide families based on class size, diversity and amount of complete sequences contained therein. b To generate MCL-based de novo clusters similar to the clusters based on existing HMMs, the sequences from 2,356 randomly chosen HMM-based families were subjected to Markov Clustering at the indicated inflation parameter values. The HMM-based families were then compared to the resulting MCL-based clusters and the Jaccard similarity coefficient (Jaccard index) was calculated. The MCL-cluster with the highest Jaccard similarity coefficient was considered the cluster corresponding to the HMM-based family. A heatmap was created, with values of the Jaccard indices color-coded according to the legend. The heatmap is sorted by phylogenetic diversity of the HMM-based families. At an inflation parameter of 1.1 the MCL-based clusters showed the highest similarity to the HMM-based families. c Taxonomic distribution of MCL-based families. HMMs generated from these families were compared to the RefSeq database and the taxonomic origin of the matching proteins was classified as either of viral, prokaryotic, or eukaryotic origin. More than 1,000 families are specific for the GOS dataset

Back to article page