Optimal neighborhood indexing for protein similarity search
 Pierre Peterlongo^{1}Email author,
 Laurent Noé^{2, 3},
 Dominique Lavenier^{4},
 Van Hoa Nguyen^{1},
 Gregory Kucherov^{2, 3} and
 Mathieu Giraud^{2, 3}
https://doi.org/10.1186/147121059534
© Peterlongo et al; licensee BioMed Central Ltd. 2008
Received: 30 May 2008
Accepted: 16 December 2008
Published: 16 December 2008
Abstract
Background
Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet.
Results
The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated evalue parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum.
Conclusion
We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of largescale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.
Keywords
Background
One fundamental task in bioinformatics concerns large scale comparisons between proteins or families of proteins. It often constitutes the first step before further investigations. A typical comparison, for example, is to query a database with a newly discovered sequence. Observed similarities witness a putative common biological function and direct further studies.
In this paper, we focus on massive protein sequence comparisons: a large database is iteratively compared with relatively short queries (such as newly sequenced data). A possible approach is to use the exact dynamic programming method [1]. For a given similarity model, this method provides optimal alignments within a quadratic computation time. Some optimizations achieve a subquadratic complexity [2], but the computation time remains prohibitive for large scale comparisons. Thus, in practice, the full dynamic programming approach is applied to comparison of short sequences.
A successful family of similarity search methods is provided by seedbased heuristics, starting with Fasta [3] and Blast [4] and including specific methods for protein similarities such as Blastp [5]. Seedbased heuristics were recently enhanced by advanced seeding tools like the spaced seeds used in PatternHunter [6] or Yass [7] (see [8] for a recent survey). Authors of this paper also worked on the alliance between advanced seeds techniques and reconfigurable architectures [9].

Stage 1: search for keys that occur in both strings,

Stage 2: extension of these matching keys with an ungapped alignment, keeping only the alignments with a score greater than a given threshold $\mathcal{T}$,

Stage 3: full dynamic programming algorithm, applied only to successfully extended matching keys.
In this work, we consider comparisons between a set of protein queries against a large protein database of N amino acids. A common usage of Blast is to index the queries, and then to scan the full database at the runtime. If the size of the query and the database allow it, a full indexation of both leads to advantageous results [10]. In our work, we applied approach used e.g. in Blat [11] where the database is indexed once and each query is successively processed.

α is the number of bits for coding a character (amino acid), and

L is the length of each neighborhood.
In common experiments, ⌈log_{2} N⌉ is between 20 and 40, αL is between 20 and 200, hence r is between 2 and 21. It is worth mentioning that the ⌈log_{2} N⌉ value is often raised to a more practical 32 or 64 bits, reducing the ratio r even more. Storing neighborhoods becomes then relevant with the reduction of memory prices. For instance, the modern technology brings the possibility to get gigabytes of Flash memory in a personal computer for some hundred dollars. It is thus interesting to exploit this storage space as much as possible. It can be used for treating larger databases, but also, as in this work, for speeding up widely used applications.
However, the index size still remains the main limitation. In this paper, we study how the size of a large neighborhood index can be reduced while preserving the result quality. For this purpose, we worked on reducing as much as possible the ratio r. A way for doing this is to reduce the factor αL. We propose to simultaneously increase the neighborhood length (L) and reduce the alphabet size (2·α). We limit the alphabet size by partitioning amino acids into groups. This reduces a by encoding neighborhood characters in less than 5 bits required for coding 20 amino acids. Partitioning the amino acids into 16 groups enables to encode each group using 4 bits, and partitioning into 8, 4 or 2 groups enables to encode each group by 3, 2, and 1 bits respectively. All these reduced alphabets are tested in this paper.
Grouping amino acids was studied in several papers [13–16]. Groups can rely on amino acid physicalchemical properties or on a statistical analysis of alignments. For example, the authors of [13] computed correlation coefficients between pairs of amino acids based on the BLOSUM50 matrix and used a greedy algorithm to merge them. A branchandbound algorithm for partitioning the amino acids was proposed in [14]. Those papers mainly deal with the construction of reduced alphabets, but none of them studies how the alphabet reduction affects the sensitivity of similarity search, or undertakes a quantitative analysis of the tradeoff between search sensitivity and index size for those alphabets. This raises the following problem that is solved in this paper: Can reduced alphabets allow one to decrease the factor αL while preserving the quality of similarity search results?
Results and discussion
Stage two algorithm
Algorithm 1 Stage 2 

Ensure: reports if a matching key occurrence potentially belongs to an alignment 
Require: query neighborhoods (left_{query} and right_{query}) 
1: get database neighborhoods left_{db} and right_{db} 
2: result_{left} ← 0; highest_{left} ← 0 
3: result_{right} ← 0; highest_{right} ← 0 
4: for i from 1 to L do 
5: result_{left} ← result_{left} + subst _score (left_{db} [i], left_{query} [i]) 
6: if result_{left} > highest_{left} then highest_{left} ← result_{left} endif 
7: end for 
8: for i from 1 to L do 
9: result_{right} ← result_{right} + subst _score(right_{db} [i], right_{query} [i]) 
10: if result_{right} > highest_{right} then highest_{right} ← result_{right} endif 
11: end for 
12: if highest_{left} + highest_{right} ≥ threshold $\mathcal{T}$then return true endif 
13: return false 
Memory for neighborhood storage for different alphabets with adapted neighborhood lengths
indexed neighborhood alphabet  bits per character (α)  neighborhoods length (L)  total per index line (2αL)  relative gain compared to Σ_{20} (1 – 2αL/ 110)  

Σ_{20}  5  11  110  0%  Memory 
Σ_{16}  4  12  96  13%  
Σ_{8}  3  14  84  24%  
Σ_{4}  2  19  76  31%  
Σ_{2}  1  32  64  42% 
The main idea is to represent the neighborhoods of keys stored in the index (see Figure 3) over a reduced alphabet. Consequently, at Stage 2 of the similarity search, amino acid sequences are compared with sequences over the reduced alphabet. By an alignment over Σ × Σ', we understand an alignment between a sequence over Σ and a sequence over Σ'. Thus, in this paper we will consider alignments over Σ_{20} × Σ_{20}, Σ_{20} × Σ_{16}, Σ_{20} × Σ_{8}, Σ_{20} × Σ_{4} and Σ_{20} × Σ_{2}.
Stage 2 algorithm and quality
A detailed description of Stage 2 is given in Algorithm 1 (Table 1). Query and database neighborhoods of a matching key (detected during Stage 1) are compared character by character over L positions. During this comparison that uses substitution score matrices (lines 1 and 1), the highest scores for the left and right neighborhoods are kept (lines 1 and 1). If the sum of the highest scores exceeds a threshold $\mathcal{T}$, the alignment is kept for Stage 3 (line 1), otherwise it is rejected (line 1). Note that in the offset indexing case, a random memory access is performed in order to retrieve neighborhoods left_{ db }and right_{ db }(line 1). This is not the case for the neighborhood indexing, as the neighborhoods are stored directly in the index.
The quality of Stage 2 is measured by a tradeoff between its sensitivity (ability to extend true alignments) and selectivity (ability to filter out spurious seed hits). Computation of those values is described page 10.
Increasing the threshold $\mathcal{T}$ or decreasing the neighborhood length L makes Stage 2 more selective but less sensitive (faster execution at the price of worse quality results) while decreasing $\mathcal{T}$ or increasing L increases the sensitivity and decreases the selectivity (better quality results at the price of a slower execution).
Reducing the index size by 35% without loss of quality
In practice, this result enables a reduction of the index size without any sacrifice in running time or in result quality. Table 2 shows the memory requirements for different alphabets. We obtain a practical reduction of 42% of the factor αL using the reduced alphabet Σ_{2} instead of Σ_{20}. The ratio r on the overall index size is then reduced by 35%.
Rectangular substitution score matrices
RE BLOSUM 62 matrix for alphabet Σ_{20} × Σ_{16}
C  F  Y  W  M  L  I  V  G  P  A  T  S  N  H  Q  E  D  R  K  

[C]  9  2  2  2  1  1  1  1  3  3  0  1  1  3  3  3  4  3  3  3 
[FY]  2  5  5  1  0  0  1  1  3  3  2  2  2  3  0  2  3  3  2  3 
[W]  2  1  2  11  1  2  3  3  2  4  3  2  3  4  2  2  3  4  3  3 
[ML]  1  0  1  2  3  4  1  1  3  3  1  1  2  3  2  2  3  3  2  2 
[IV]  1  0  1  3  1  1  3  3  3  3  1  0  2  3  3  2  3  3  3  2 
[G]  3  3  3  2  3  4  4  3  6  2  0  2  0  0  2  2  2  1  2  2 
[P]  3  4  3  4  2  3  3  2  2  7  1  1  1  2  2  1  1  1  2  1 
[A]  0  2  2  3  1  1  1  0  0  1  4  0  1  2  2  1  1  2  1  1 
[T]  1  2  2  2  1  1  1  0  2  1  0  5  1  0  2  1  1  1  1  1 
[S]  1  2  2  3  1  2  2  2  0  1  1  1  4  1  1  0  0  0  1  0 
[N]  3  3  2  4  2  3  3  3  0  2  2  0  1  6  1  0  0  1  0  0 
[H]  3  1  2  2  2  3  3  3  2  2  2  2  1  1  8  0  0  1  0  1 
[QE]  3  3  2  2  1  3  3  2  2  1  1  1  0  0  0  4  4  1  0  1 
[D]  3  3  3  4  3  4  3  3  1  1  2  1  0  1  1  0  2  6  2  1 
[R]  3  3  2  3  1  2  3  3  2  2  1  1  1  0  0  1  0  2  5  2 
[K]  3  3  2  3  1  2  3  2  2  1  1  1  0  0  1  1  1  1  2  5 
RE BLOSUM 62 matrix for alphabet Σ_{20} × Σ_{8}
C  F  Y  W  M  L  I  V  G  P  A  T  S  N  H  Q  E  D  R  K  

[CF Y W ]  4  4  4  5  1  0  1  1  3  3  2  2  2  3  0  2  3  3  3  3 
[M LIV ]  1  0  1  2  2  3  3  2  3  3  1  1  2  3  3  2  3  3  2  2 
[G]  3  3  3  2  3  4  4  3  6  2  0  2  0  0  2  2  2  1  2  2 
[P ]  3  4  3  4  2  3  3  2  2  7  1  1  1  2  2  1  1  1  2  1 
[AT S]  1  2  2  3  1  2  1  1  0  1  2  2  2  0  1  1  1  1  1  1 
[N H]  3  2  0  3  2  3  3  3  1  2  2  1  0  5  5  0  0  1  0  0 
[QED]  3  3  2  3  2  3  3  3  2  1  1  1  0  0  0  3  3  4  0  0 
[RK]  3  3  2  3  1  2  3  2  2  1  1  1  0  0  0  1  0  1  4  4 
RE BLOSUM 62 matrix for alphabet Σ_{20} × Σ_{4}
C  F  Y  W  M  L  I  V  G  P  A  T  S  N  H  Q  E  D  R  K  

[CF Y W ]  6  6  6  7  1  1  1  2  4  5  2  3  3  4  1  4  4  5  4  4 
[M LIV ]  2  0  2  3  3  4  4  4  5  4  1  1  3  5  4  3  4  5  4  3 
[GP AT S]  2  4  3  4  2  3  3  2  4  4  3  2  2  1  2  1  1  2  2  1 
[N HQEDRK]  5  4  2  4  3  4  5  4  2  2  2  1  0  3  3  3  3  3  3  3 
RE BLOSUM 62 matrix for alphabet Σ_{20} × Σ_{2}
C  F  Y  W  M  L  I  V  G  P  A  T  S  N  H  Q  E  D  R  K  

[CF Y W M LIV ]  4  4  3  4  3  4  4  3  6  6  2  2  4  6  4  4  6  7  5  5 
[GP AT SN HQEDRK]  4  5  4  6  3  5  5  4  2  2  1  1  2  2  1  2  2  2  2  2 
Such matrices can be applied in any method reducing the amino acid alphabets by residue grouping. As one may be interested in using any other pair of alphabets, we additionally propose a web interface [17]. This web interface computes RE BLOSUM matrices for other alphabets listed in [16] and for any custom alphabets provided by the user.
Parameters for evalue computation
The evalue, or expected value, provides the expected number of alignments with a given score, when comparing a text T and a query Q of length T  and Q respectively. Local alignment methods like Blast sort results by increasing evalue, thus reflecting their decreasing significance. In the Blast algorithm, the evalue of an alignment is obtained byevalue = K·Q · T · e^{λs},
Gumbel law parameters λ and K for different alphabets, obtained with the corresponding RE BLOSUM score matrices.
alphabets  λ  K 

Σ_{20} × Σ_{20}  0.320  0.139 
Σ_{20} × Σ_{16}  0.333  0.143 
Σ_{20} × Σ_{8}  0.223  0.142 
Σ_{20} × Σ_{4}  0.212  0.128 
Σ_{20} × Σ_{2}  0.161  0.101 
Experimental validation
In a model where the Stage 2 alignments are ungapped, using reduced alphabets and alignments on longer neighborhoods can however affect the result quality. Indeed, the longer the neighborhoods are, the bigger is the chance to meet a gap in the sequences. More generally, the probabilities distributions used in theoretical sensitivity and selectivity computations do not truly reflect the nature of the biological sequences.
We thus validated our approach with largescale tests on biological sequences. We set a database to be the hardmasked human chromosome 21 (UCSC Release hg18) translated according to the six possible reading frames. The query set was a set of seven archea and bacteria proteomes derived from a study of mitochondrial diseases. This set was selected for is interest toward the detection of potential insertions of mitochondrial genes in the human genome. Moreover, testing out our approach comparing such distant species represents one of the hardest application case. Indeed more typical homology searches on closer sequences is easier. Tests on such homology searches could have hidden potential issue on our approach.
The database contained 12 700 507 amino acids whereas the query was composed by 5 321 439 amino acids. Using the ssearch method [20], 650 alignments were obtained between the database and the query (maximal evalue: 10^{3}). This set of exhaustive optimum alignments was sufficient to validate our method in comparison with results obtained using different alphabets. The seed used in Stage 1 was a subset seed (see [21]), as in [9]. For the neighborhood indexing, we indexed the database using each of the alphabets Σ_{20}, Σ_{16}, Σ_{8}, Σ_{4} and Σ_{2}. We selected the neighborhood length to have a theoretical sensitivity close to 0.95 and a theoretical selectivity close to 0.01. Theoretical sensitivity and selectivity are defined according distributions presented on page 10.
Practical results for different alphabets – Quality estimations
alphabets  number of positions validating Stage 1 and Stage 2  practical selectivity  number of detected alignments  practical sensitivity 

Σ_{20} × Σ_{20}  2.14 * 10^{6}  1.35 * 10^{3}  650 (all)  1 
Σ_{20} × Σ_{16}  1.39 * 10^{6}  0.88 * 10^{3}  650 (all)  1 
Σ_{20} × Σ_{16}  0.98 * 10^{6}  0.62 * 10^{3}  650 (all)  1 
Σ_{20} × Σ_{8}  0.62 * 10^{6}  0.39 * 10^{3}  650 (all)  1 
Σ_{20} × Σ_{4}  3.14 * 10^{6}  1.98 * 10^{3}  650 (all)  1 
Σ_{20} × Σ_{2}  2.93 * 10^{6}  1.85 * 10^{3}  650 (all)  1 
Practical results for different alphabets – Memory requirements
alphabet  α  L  S _{neighborhood}  r 

Σ_{20}  5  11  1.70 * 10^{9} bits = 212 MBytes  5.58 
Σ_{16}  4  12  1.52 * 10^{9} bits = 190 MBytes  5.00 
Σ_{8}  3  14  1.37 * 10^{9} bits = 171 MBytes  4.50 
Σ_{4}  2  19  1.27 * 10^{9} bits = 159 MBytes  4.17 
Σ_{2}  1  32  1.12 * 10^{9} bits = 140 MBytes  3.67 
As shown in Table 8, each of the reduced alphabets yields a practical full sensitivity, as all the 650 alignments are found in each test. Moreover, the practical selectivity, close to 10^{3}, is here better than the theoretical one (0.01).
Conclusion
We proposed a method for reducing the index size when storing neighborhoods of seed keys in protein databases. This approach is based on reducing the alphabet of indexed data while using a longer neighborhood. We save 35% of the index size without any modification on the result quality assuming an ungapped alignment model. We provided optimal lengths for selected alphabets.
Furthermore, the proposed method requires unusual substitutions score matrices that are called RE BLOSUM, for rectangular BLOSUM matrices. These matrices provide substitution scores between letters from different alphabets. We extended the computation of traditional BLOSUM matrices in order to compute RE BLOSUM matrices, and adapted the computation of λ and K parameters for evalue estimation to reduced alphabets. We provided RE BLOSUM matrices and their corresponding λ and K values for selected alphabets. Other matrices and parameters can be obtained from the website [17].
Methods
In this section, we describe the methods we used to compute the sensitivity and selectivity of similarity search on reduced alphabets as well as the neighborhood length. We further describe the computation of RE BLOSUM substitution score matrices and of the evalue parameter. Moreover, we explain how the threshold $\mathcal{T}$ is computed at Stage 2 depending on the evalue specified by the user. Finally, we describe how we estimated the time gain of the the neighborhood indexing over the offset indexing.
Selectivity and sensitivity computation
Note that here we focus on the behavior of Stage 2 and do not take into consideration the sensitivity/selectivity of Stage 1. In particular, in the above fractions we consider only alignments that extend a hit presumably reported at Stage 1.
The sensitivity and the selectivity of Stage 2 rely on three parameters: the alphabet choice, the neighborhood length, and the score threshold $\mathcal{T}$. Given these three parameters, we applied a dynamic programming algorithm to compute the probability for the filter to retain an alignment drawn according to a given amino acid pair distribution. Applied to distributions of "true" and "random" alignments (foreground and background distributions, respectively), the algorithm gives a theoretical estimation of the sensitivity and the selectivity of the filter. The two distributions were the Bernoulli models (namely the expected and the observed probabilities, see below), obtained with the BLOSUM programs on the BLOCKS protein database when processing the BLOSUM62 matrix.
In our Algorithm 1, two neighborhoods (left and right) are processed. We thus consider the sum of two maximal scores, reached in the left and right neighborhoods. The probability that this sum reaches a given threshold $\mathcal{T}$ at least once is computed as follows. First, we compute the probability for each neighborhood independently to reach any given maximal score s (s ≥ 0) within the neighborhood length. Then, these two independent discrete distributions are combined to compute the $\mathcal{T}$ threshold requirement.
For our experiments, we calibrated the neighborhoods lengths to have a sensibility close to 0.95 and a selectivity close to 0.01, and computed related thresholds values (available of the RE BLOSUM website).
Computing RE BLOSUM matrices
where p_{ i }· p_{ j }is the expected probability of aligning i against j, and q_{i, j}is the observed probability of the same event in a subset of alignments of the BLOCKS database that have at least × percent of identity. (Note that the computation of q_{i, j}takes into account different contributions provided by alignments with different identity levels.)
where p_{ I }·p_{ J }is the expected frequency of aligning any amino acids from group I ⊆ Σ against any other amino acid from group J ⊆ Σ', and q_{I, J}is the observed frequency of the same event in a subset of alignments of the BLOCKS database that have at least × percent of identity. The recent paper [25] discovered flaws in the original BLOSUM implementation, but shows that a corrected program does not improve (and even in some cases decreases) the results quality. Therefore, we did not take the proposed modifications into account.
The website [17] proposes a selection of RE BLOSUM matrices for several alphabets, as well as an interface to compute RE BLOSUM matrices for any alphabet and identity level specified by the user.
Prototype for estimating the time gain of offset indexing over neighborhood indexing
For comparing the execution time between offset indexing and neighborhood indexing, a $\mathcal{C}$ prototype was created. In the case of the offset indexing, the index stores positions of all seeds in an unique integer array. For each seed key, a pointer provides the first occurrence in this array. In the case of the neighborhood indexing, the index uses a (unique) structure array instead of an integer array. For each key occurrence, the structure contains the key position together with two neighborhoods.
Tests reported in Figure 4 were run on a 2 GHz PC with an AMD Opteron processor. The database size was selected so that the index fits into the main memory (4 GB) but not into the L1/L2 cache (1 MB). In those tests, the neighborhood indexing performs almost twice as fast as the offset indexing.
Declarations
Acknowledgements
This work was supported by the INRIA ARC grant "Flash" (Seed Optimization and Indexing of Genomic Databases).
Authors’ Affiliations
References
 Smith T, Waterman M: Identification of common molecular subsequences. Journal of Molecular Biology 1981., 147(195–197):Google Scholar
 Crochemore M, Landau G, ZivUkelson M: A Subquadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices. Symposium On Discrete Algorithms (SODA 02) 2002, 679–688.Google Scholar
 Lipman D, Pearson W: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul S, Gish W, Miller W, Myers W, Lipman D: Basic local alignment search tool. Journal of Molecular Biology 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
 Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
 Li M, Ma B, Kisman D, Tromp J: PatternHunter II: Highly Sensitive and Fast Homology Search. Journal of Bioinformatics and Computational Biology 2004, 2(3):417–439. [(early version in GIW 2003)]. 10.1142/S0219720004000661View ArticlePubMedGoogle Scholar
 Noé L, Kucherov G: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Research 2005, 33: W540W543. 10.1093/nar/gki478PubMed CentralView ArticlePubMedGoogle Scholar
 Brown DG: Bioinformatics Algorithms: Techniques and Applications. In A survey of seeding for sequence alignment. WileyInterscience (I. Mandoiu, A. Zelikovsky); 2008:126–152.Google Scholar
 Peterlongo P, Noé L, Lavenier D, Georges G, Jacques J, Kucherov G, Giraud M: Protein similarity search with subset seeds on a dedicated reconfigurable hardware. Parallel Biocomputing Conference (PBC 07), Volume 4967 of Lecture Notes in Computer Science (LNCS) 2007.Google Scholar
 Nguyen VH, Lavenier D: Speeding up Subset Seed Algorithm for Intensive Protein Sequence Comparison. 6th International Conference on research, innovation and vision for the future 2008.Google Scholar
 Kent WJ: BLATthe BLASTlike alignment tool. Genome Res 2002, 12(4):656–664.PubMed CentralView ArticlePubMedGoogle Scholar
 Hennessy JL, Patterson DA: Computer Architecture, A Quantitative Approach. Morgan Kaufmann; 2006.Google Scholar
 Murphy L, Wallqvist A, Ronald L: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Engineering 2000, 13(3):149–152. 10.1093/protein/13.3.149View ArticlePubMedGoogle Scholar
 Cannata N, Toppo S, Romualdi C, Valle G: Simplifying amino acid alphabets by means of a branch and algorithm and substitution matrices. Bioinformatic 2002, 18(8):1102–1108. 10.1093/bioinformatics/18.8.1102View ArticleGoogle Scholar
 Li T, Fan K, Wang J, Wang W: Reduction of protein sequence complexity by residue grouping. Protein Engineering 2003, 16(5):323–330. 10.1093/protein/gzg044View ArticlePubMedGoogle Scholar
 Edgar R: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 2004, 32: 380–385. 10.1093/nar/gkh180PubMed CentralView ArticlePubMedGoogle Scholar
 ReBLOSUM:: Rectangular BLOSUM Matrices[http://bioinfo.lifl.fr/reblosum/]
 Henikoff J, Henikoff S: Amino Acid Substitution Matrices form Protein Blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
 Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 1990, 87(6):2264–2268. 10.1073/pnas.87.6.2264PubMed CentralView ArticlePubMedGoogle Scholar
 Lipman D, Pearson W: Rapid and Sensitive Protein Similarity Searches. Science 1985, 227: 1435–1441. 10.1126/science.2983426View ArticlePubMedGoogle Scholar
 Roytberg M, Gambin A, Noé L, Lasota S, Furletova E, Szczurek E, Kucherov G: Efficient seeding techniques for protein similarity search. Bioinformatics Research and Development, Proceedings of the 2nd International Conference BIRD 2008, Vienna (Austria), July 7–9, 2008, of Communications in Computer and Information Science, Springer Verlag 2008, 13: 466–478.Google Scholar
 Henikoff S, Henikoff J: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
 Henikoff S, Henikoff J: Automated assembly of protein blocks for database searching. Nucleic Acids Res 1991, 19(23):6565–6572. 10.1093/nar/19.23.6565PubMed CentralView ArticlePubMedGoogle Scholar
 Blosum database[http://sci.cnb.uam.es/Services/ftp/databases/blocks/unix/blosum/]
 Styczynski MP, Jensen KL, Rigoutsos I, Stephanopoulos G: BLOSUM62 miscalculations improve search performance. Nat Biotech 2008, 26(3):274–275. 10.1038/nbt0308274View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.