Table 1 Dataset Statistics.

From: Building multiclass classifiers for remote homology detection and fold recognition

Statistic sf95 sf40 fd25 fd40
ASTRAL filtering 95% 40% 25% 40%
Number of Sequences 2115 1119 1294 1651
Number of Folds 25 25 25 27
Number of Superfamilies 47 37 137 158
Avg. Pairwise Similarity 12.8% 11.5% 11.6% 11.4
Avg. Max. Similarity 63.5% 33.9% 32.2% 34.3
Avg. Pairwise Similarity (within folds) 25.6% 17.9% 16.7% 17.4
Avg. Pairwise Similarity (outside folds) 10.4% 11.03% 11.2% 11.0
  1. The percent similarity between two sequences is computed by aligning the pair of sequences using SW-GSM with a gap opening of 5.0 and gap extension of 1.0. "Avg. Pairwise Similarity" is the average of all the pairwise percent identities, "Avg. Max. Similarity" is the average of the maximum pairwise percent identity for each sequence i.e, it measures the similarity to its most similar sequence. The "Avg. Pairwise Similarity (within folds)" and "Avg. Pairwise Similarity (outside folds)" is the average of the average pairwise percent sequence similarity within the same fold and outside the fold for a given sequence.