Table 5 Numbers of "words" of the five building blocks for remote homology detection

From: A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Building blocks N-grams patterns motifs binary profiles Top-n-gram-combine
Numbers of "words" 8000 8000 3231 1087 420
  1. The combine suffix refers to Top-n-grams combining Top-1-grams and Top-2-grams. N-grams are the set of all possible subsequences of a fixed length 3, so the total words of N-grams are 8000 (203) [32]. Patterns are extracted by TEIRESIAS [38] and totally 71009 patterns are extracted [32]. Through χ2 selection [34], 8000 patterns are selected as the characteristic words [32]. The MEME/MAST system [40] is used to discover motifs and search databases. Totally, 3231 motifs are extracted [32]. The optimized probability threshold 0.13 is used to convert the protein sequence frequency profiles into binary profiles and 1087 words are obtained [28]. Top-1-grams and Top-2-grams have 20 and 400 words respectively, so the total words of Top-n-gram-combine are 420 (20+400).