- Research Article
- Open Access
Fast batch searching for protein homology based on compression and clustering
© The Author(s) 2017
Received: 18 July 2017
Accepted: 14 November 2017
Published: 21 November 2017
In bioinformatics community, many tasks associate with matching a set of protein query sequences in large sequence datasets. To conduct multiple queries in the database, a common used method is to run BLAST on each original querey or on the concatenated queries. It is inefficient since it doesn’t exploit the common subsequences shared by queries.
We propose a compression and cluster based BLASTP (C2-BLASTP) algorithm to further exploit the joint information among the query sequences and the database. Firstly, the queries and database are compressed in turn by procedures of redundancy analysis, redundancy removal and distinction record. Secondly, the database is clustered according to Hamming distance among the subsequences. To improve the sensitivity and selectivity of sequence alignments, ten groups of reduced amino acid alphabets are used. Following this, the hits finding operator is implemented on the clustered database. Furthermore, an execution database is constructed based on the found potential hits, with the objective of mitigating the effect of increasing scale of the sequence database. Finally, the homology search is performed in the execution database. Experiments on NCBI NR database demonstrate the effectiveness of the proposed C2-BLASTP for batch searching of homology in sequence database. The results are evaluated in terms of homology accuracy, search speed and memory usage.
It can be seen that the C2-BLASTP achieves competitive results as compared with some state-of-the-art methods.
The task of batch searching for protein homology often arise in the field of bioinformatics. As the exponential growth [1, 2] of protein databases, searching for homologs often become ineffective due to the intensive computational efforts involved . For example, in order to investigate the homology of a new protein sequence set, a cross-species protein identification method needs to search millions of sequences in the NR database. Moreover, since the public databases (such as PDB , NR , and SWISSPORT ) are continuously updated, the task of homology search is becoming more computationally expensive and redundant. With the increasingly number of the users and queries being accessible to the public databases, the query tasks are becoming heavy and heavy. Thus effective algorithms that match sets of protein query sequences in large-scale sequence datasets are always in demand.
BLAST  will take a longer time when the scale of query set is getting larger since it evaluates a single query once. It alternatively employs a brute force approach to compare query sequence and database sequence. More specially, the BLAST searches for short fixed-length word pairs in the sequences and then extends them to higher-scoring regions. For each query sequence, the algorithm scans the entire database and compare database sequence with the querying one to find the subsequences. The BLAST maybe conduct reduplicative scans to find common subsequences. Thus, there is an urgent need for a tool that can significantly speed up batch homology searching.
There are many efforts that develop relative techniques for efficient homology searching. MegaBLAST  is a greedy sequence alignment algorithm. It is faster than basic BLAST, but it is less effective for aligning highly similar sequences with larger size. MPBLAST  concatenates queries by grouping them into a single query, with the objective of reducing times of database accessing. BLAST++  transforms a collection of queries into a single virtual query, which guarantees the seed searching process to be performed once for common subsequences. However, it does not take the redundancy of database into consideration, and will get inefficiency when applied in large-scale database. The BLAST+  is developed based on the advanced results from MPBLAST, BLAST++, miBLAST , BLAT . However, its performance is unsatisfactory for batch queries when applied to search on large-scale dataset. MpiBLAST  speeds up homology search by using parallel processing technique on a cluster of machines. CUDA-BLASTP  utilize GPU to speed up searching, however, it is not suitable for supporting large-scale databases due to the limit of memory size. Following the mechanism of CUDA-BLASTP, several homology search tools have been developed, such as RAPSearch  and GHOSTZ . However, these methods require more space to retain relative information of sequences, which incurs excessive memory and storage cost. So, the problem of batch searching for protein homology still remains challenging and there remains much room for researchers to improve their algorithms.
In this paper, we conduct studies with the objective of improving the performance of batch homology search, and a fast compression and clustering based BLASTP (C2-BLASTP) algorithm for large-scale protein homology search is proposed. Firstly, the query set and the database are compressed to reduce sequence redundancy. Then a new database is clustered according to the Hamming distance of similar subsequences. The objective is to minimize the computation time on ungapped extensions. Furthermore, an execution database is constructed, on which the homology search is performed. The execution database is considered as a collection of all the potential homologous sequences.
An effective strategy to improve the efficiency of batch query is to reduce the redundant sequences in query set and the database. The underlying mechanism works by finding representative sequences to express the information throughout the sequence sets. To guarantee the search precision and speed, the representative sequences are expected to be non-redundant as well as to express complete information. The proposed fast batch homology search algorithm (C2-BLASTP) has three major components, i.e., the compression, the clustering, and the batch searching. In the compression process, the database and the query set are compressed by removing the subsequences with high similarity, and leaving the representative subsequences remained. In the clustering process, the subsequences in the compressed database is further grouped based on their similarities, and the potential hits will be obtained. In the batch searching process, a small scale executable database is constructed by the potential homology hits, and the homology search is performed in the execution database. The details above three components are presented in the following subsections.
In the phase of compressing, the associations among potential highly similar subsequences are setup by a mapping between seeds and subsequences, where seed refers to a segment of protein sequence with five amino acids, and subsequence refers to a fraction of protein sequence. The similarity among the subsequences that point to the same seed is evaluated by Needleman-Wunsch . The highly similar subsequences are grouped into one cluster, with one appropriate subsequence being retained as its representation. By applying this mechanism, the data redundancies can be reduced. Meanwhile, the query sequence and database can be compressed.
An initial key-entry pair map structure is constructed. Each key in the map is a segment of protein sequence with five amino acids, and it is also called a seed. The attributes of the key include an index number in the database (also referred as sequence number), a starting amino acid position, and a link to the next subsequence. By scanning the protein sequence from left to right, a key is created using every five amino acids. Figure 1 shows an example of the key entry pair map structure.
Each sequence in the query set or the protein database is compared with the existing keys in the current map. By scanning the input protein sequence from left to right, the keys are compared with every five successive amino acids. If the compared segment matches one of the existing keys, the Needle man Wunsch algorithm is carried out, the segment will be truncated starting from the current position, and will be connected with other segments that are linked by the matched key. Otherwise, a new key will be added, and its corresponding entry attributes will be added to the current map.
Redundant segments in sequences are compressed. Similarity can be computed according to the alignment result using BLOSUM62 . When the similarity is higher than a given threshold (80%), the referred subsequence is considered to be redundant. So the subsequence is deleted, meanwhile, a new link to the current key is added and the difference between the two subsequences is recorded in a special script.
A final non-redundant segment pool is created. The new database consists of non-redundant segments of protein sequence and the corresponding sequence information.
By conducting the compression process, the redundancy in the query set and the protein database can be reduced. However, since the compressed protein database is still large as the fast growing of protein sequences, the online running of BLASTP is still time consuming. Moreover, the traditional BLASTP takes much time extending alignments without gaps because of the large number of seeds (including 3 amino acids). The C2-BLASTP further conduct clustering on the compressed database. Following this, the process of hits finding is implemented on the representative seed of each cluster.
How to determine the key length is crucial in key finding task. In fact, the short subsequences of the same length tend to appear with different frequencies in the database because of the composition bias in biology. It has been validated that the keys with 6-9 amino acids tend to appear with higher efficiency . So, the lengths of keys are automatically selected in the range of 6-9 amino acids based on the sum of the match scores of the short subsequences. The match score is obtained by the BLOSUM62 score matrix and is taken by the highest score in each group of amino acids. To avoid insignificant short segments, the threshold T is taken empirically with value 39. When the sum of match scores for short subsequences exceeds T, the subsequence is considered as a key. For example, the subsequence ‘YKWVN’ is not used as a key because its score sum is less than 39, while ‘YKWVNK’ is used as a key because its score higher than 39. If a key is obtained, then a key-entry map is created and extended by following a similar procedure in compression process. Finally, a complete key-entry map (Map1) for all of the keys can be obtained.
Next, seeds can be generated from keys. The seeds are composed of ten residues, with the first five residues being extended forward from the starting point of the key, and the remaining residues being taken from the first five residues of the key. Finally, the seeds produced from the same key are clustered according to Hamming distance, respectively. The seeds will be group into one cluster if their similarity exceeds a given threshold (90%). Each cluster has one representative seed, with other seeds being linked to. Meanwhile, two association diagrams are created. The first diagram is the seed-entry map for the representative seed (Map2), and its entry includes the cluster ID and the location of representative seed. The other diagram is the clustering map (Map3). As shown in Fig. 4 c, the diagram describes the cluster ID and the location of its cluster member. The above procedure accelerates the search speed since it groups similar subsequences together.
The clustered database is constructed offline by implementing the operators of compression and clustering. It needs to be updated regularly as the database expanses. For given query sequences, the objectives lie with finding enough information for homology from the clustered database, and creating a smaller scale execution database. The execution database is a collection of all the potential homologous sequences with which the homology search can be performed.
Since hits associate with potential homologous sequences, how to find hits from the clustered database plays an important role in constructing the execution database. Hits are the set of results obtained by searching the clustered database using compressed query set as index. To compare query sequences with the clustered database that is described by three maps in “Clustering”section, we construct the seed-entry map for query set and keep their format being consistent. More specifically, the query sequences are firstly re-expressed by the reduced amino acid alphabets, and then every ten adjacent residues are taken as a seed in the query set directly. Thereafter, we compare each seed in query set with the representative seeds in Map2. If they are identical, the corresponding original fragments (non-reduced amino acid alphabets) can be recovered according to their entries in maps. So, the similarity between the fragment of query sequences and the cluster representative can be calculated. If the similarity exceeds a given threshold (80%), all the members in the cluster can be obtained by the cluster ID. Then we conduct gapped and ungapped extensions to obtain hits.
Where S q , S m and S r are the query seed, the cluster member, and the cluster representative, respectively. d(S 1,S 2) is the distance between seed S 1 and seed S 2. In particular, the maximum value of d(S r ,S m ) is 1 because the cluster threshold T c is taken as 90%. So, the lower bound of distance between S q and S m can be obtained. If the lower bound is less than or equal to the distance calculated from similarity threshold T s , then the query seed may be highly similar to the member seed. Therefore, we conduct gapped and ungapped extension to get hits.
Results and discussion
Experimental datasets and settings
In this section, experiments are conducted to evaluate the performance of the proposed C2-BLASTP. In the experiments, the NR database built on June 2013 is taken as benchmarks. The database has 26.7 million protein sequences, including a total of 9.3 billion amino acids. We randomly select a certain number of sequences from the Saccharomyces Genome Database (SGD) and the ENV _NR Database as query sequences. The SGD contains the proteomes of 21 strains of yeast . The ENV _NR contains some translations from the ENV.NT (nucleotide) database, and the ENV.NT contains DNA sequences from the environment directly. The organization of the datasets indicates the varieties of their organisms. The proteins from environmental projects are presented in either the NR or the ENV _NR database, depending upon whether that sequence has been identified as a particular organism (NR), or the organism is unknown (ENV _NR). All the experiments are carried out on a work station with dual 4-core Intel Xeon E-2609 processor, 32 GB memory and using Centos Linux.
Existing algorithms for comparison
BLASTP (BLAST+ version 2.2.31): BLASTP (Basic Local Alignment Search Tool for Protein) can be used to infer functional and evolutionary relationships among sequences. The executing process include word matching, ungapped extension, and gapped extension. The algorithm can be used to compare protein sequences with sequence databases and to calculate the statistical significance of matches, and it also can be used to infer functional and evolutionary relationships among sequences.
CaBLASTP  (Version 1.0.3): CaBLASTP introduces compression strategy and achieves a faster speed than BLAST by searching in the compressed database. It firstly searches the protein homology in a coarse database where the redundant subsequences are removed, and then uses the obtained initial results to search the original database for similar sequences.
GHOSTZ  (Version 1.0.0): GHOSTZ uses the strategy of clustering database subsequence and filters out the non-representative seeds within these clusters to minimize the computation time spent on ungapped extensions.
Effects of compression
Comparison results using different compression threshold for the C2-BLASTP
T t for query set
Comparison with other methods and analysis
Comparative results of the C2-BLASTP with other algorithms
The advantage of the GHOSTZ lies in performing seed search in the offline process of database construction. And the representative seeds further improve the search speed. However, the GHOSTZ adopts the reduced amino acid alphabets in the original database, so the more underlying matched seeds will result in the larger number of alignments. When the query set is relatively small, the number of seeds in BLASTP is not so large. In this case, GHOSTZ does not have advantage over other algorithms in terms of speed. Besides, GHOSTZ need more memory requirements during the process of creating clustered database. The C2-BLASTP compress the original database offline at one time, and further the representative seeds are obtained by clustering. Due to such advantages, it outperforms other algorithm with the small-scale query set (<200 sequences) in terms of speed. With the increase of the query sequences, C2-BLASTP spends much time in reconstructing execution database.
Analysis of memory and disk cost
Costs of memory and disk for appended sequence set using different algorithms
Memory size (GB)
Disk size (GB)
The rapid growth of the protein sequences in databases makes batch homology search challenging. The proposed C2-BLASTP fully exploits the joint information among the query sequences and the database. To recude redundancies, the queries set and database are compressed based on the Needleman-Wunsch algorithm. And the database is further clustered to reduce the computation time incurred by ungapped extensions. Finally, the homology search is conducted in a constructed execution database, which is considered as a collection of all the potential homologous sequences. In conclusion, C2-BLASTP can be implemented as an extension of the NCBI BLAST+, and can be easily interfaced with other programs that use protein BLAST as search tools. Numerical Experiments on NCBI NR database show that the C2-BLASTP for batch searching of homology in sequence database is effective. C2-BLASTP can also be integrated with different BLAST tools to improve the search speed by replacing BLASTP in the fine BLASTP process.
The current C2-BLASTP will be extended by adapting with high performance hardware such as GPU to accelerate searching speed.
As the current C2-BLASTP only realized the high speed batch searching of protein sequence, it will be extended to a wider varieties of toolbox similar to BLAST.
The authors would like to thank the anonymous reviewers and editors for their useful comments and suggestions that helped to improve this paper.
The preliminary work has been published as an abstract at International Conference on Bioinformatics and Biomedicine, 2016 (BIBM 2016). This work is an extension of the work in BIBM 2016 from the perspectives of both methodology and experiments.
This work was supported by the National Natural Science Foundation of China (61572104, 61103146, 61402076, 61502072), the Fundamental Research Funds for the Central Universities (DUT17JC04), and the Project of the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172017K03).
Availability of data and materials
The authors declare that the data and material used in this paper are available, and the datasets can be submitted to Springer Nature as part of publisher’s Data Support Services. The source code and datasets are available on https://github.com/liangsunatdlut/C2BLASTP/.
HWG made substantive intellectual contributions to this paper. LS made contributions to conception, design and analysis of this paper. JHY made contributions to the experimental implementation of this paper. All authors read and approve the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Kahn SD. On the future of genomic data. Science. 2011; 331:728–9.View ArticlePubMedGoogle Scholar
- Daniels NM, Gallant A, Peng J, Cowen LJ, Baym M, Berger B. Compressive genomics for protein databases. Bioinformatics. 2013; 29(13):39–45.View ArticleGoogle Scholar
- Zepeda G, Reyna C, Fu Y, Rodriguez L, Isabel C. Novel protein interactions with an actin homolog (mreb) of helicobacter pylori determined by bacterial two hybrid system. Microbiol Res. 2017; 201:39–45.View ArticleGoogle Scholar
- Nat Struct Biol. 2003; 10:980. doi: 10.1038/nsb1203-980.
- Pruitt KD, Tatusova T, Maglott DR. Ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005; 35:61–5.View ArticleGoogle Scholar
- The UniProt Consortium. Uniprot: the universal protein knowledgebase. Nucleic Acids Res. 2017; D158–D169(D1):158–69.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Morgulis A, Coulouris G, Raytselis Y. Database indexing for production megablast searches. Bioinformatics. 2008; 24(16):1757–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Korf I, Gish W. Mpblast: improved blast performance with multiplexed queries. Bioinformatics. 2000; 16(11):1052–3.View ArticlePubMedGoogle Scholar
- Wang H, Oi BC, Tan KL. Blast++: Blasting queries in batches. Bioinformatics. 2003; 19(17):2323–4.View ArticlePubMedGoogle Scholar
- Camacho C, Coulouris G, Avagyan V. Blast+: architecture and applications. BMC Bioinformatics. 2009; 10(1):1–9.View ArticleGoogle Scholar
- Kim YJ, Boyd A, Tthey BD. miblast: scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids Res. 2005; 33(13):4335–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Kent WJ. Blat-the blast-like alignment tool. Genome Res. 2002; 12(4):656–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Darling A, Carey L. The design, implementation, and evaluation of mpiblast. In: The 4th International Conference on Linux Clusters, San Jose. San Jose: 2003. p. 656–64.Google Scholar
- Liu W, Schmidt B, Muller-Wittig W. Cuda-blastp: accelerating blastp on cuda-enabled graphics hardware. EEE/ACM Trans Comput Biol Bioinforma. 2011; 8(6):1678–84.View ArticleGoogle Scholar
- Ye Y, Choi JH, Tang H. Rapsearch: a fast protein similarity search tool for short reads. BMC Bioinformatics. 2011; 12(1):159–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Suzuki S, Kakuta M, Ishida T, Akiyama Y. Faster sequence homology searches by clustering subsequences. Bioinformatics. 2015; 31(8):1183–90.View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48:443–53.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG. Amino acid substitution matrices. Adv Protein Chem. 2000; 54:73–97.View ArticlePubMedGoogle Scholar
- Cherry JM, Hong EL, Amundsen C. Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 2011; 40(D1):700–5.View ArticleGoogle Scholar
- Daniels NM, Gallant A, Peng J, Cowen LJ, Baym M, Berger B. Compressive genomics for protein databases. Bioinformatics. 2013; 29(13):283–90.View ArticleGoogle Scholar