Skip to main content

Table 1 Dataset Statistics.

From: Building multiclass classifiers for remote homology detection and fold recognition

Statistic

sf95

sf40

fd25

fd40

ASTRAL filtering

95%

40%

25%

40%

Number of Sequences

2115

1119

1294

1651

Number of Folds

25

25

25

27

Number of Superfamilies

47

37

137

158

Avg. Pairwise Similarity

12.8%

11.5%

11.6%

11.4

Avg. Max. Similarity

63.5%

33.9%

32.2%

34.3

Avg. Pairwise Similarity (within folds)

25.6%

17.9%

16.7%

17.4

Avg. Pairwise Similarity (outside folds)

10.4%

11.03%

11.2%

11.0

  1. The percent similarity between two sequences is computed by aligning the pair of sequences using SW-GSM with a gap opening of 5.0 and gap extension of 1.0. "Avg. Pairwise Similarity" is the average of all the pairwise percent identities, "Avg. Max. Similarity" is the average of the maximum pairwise percent identity for each sequence i.e, it measures the similarity to its most similar sequence. The "Avg. Pairwise Similarity (within folds)" and "Avg. Pairwise Similarity (outside folds)" is the average of the average pairwise percent sequence similarity within the same fold and outside the fold for a given sequence.