Skip to main content
Figure 1 | BMC Bioinformatics

Figure 1

From: Using distances between Top-n-gram and residue pairs for protein remote homology detection

Figure 1

The process of generating Distance-based Top-1-gram feature vector. A protein S is input into the PSI-BLAST software to do the multiple sequence alignments against a non-redundant database, and then the frequency profile is calculated from the multiple sequence alignments. The frequencies of the 20 standard amino acids in each column of the frequency profile are sorted in descending order. Top-1-gram is the most frequent amino acid in each column of frequency profile. S can be represented as a sequence of Top-1-grms S' by combining all the obtained Top-1-grams according to their sequence order. Assuming that the distance threshold d MAX is set as 2, the feature vector is the combination of Top-1-gram pairs at distance 0, 1, and 2.

Back to article page