Recent Hits Acquired by BLAST (ReHAB): A tool to identify new hits in sequence similarity searches
© Whitney et al; licensee BioMed Central Ltd. 2005
Received: 28 September 2004
Accepted: 08 February 2005
Published: 08 February 2005
Sequence similarity searching is a powerful tool to help develop hypotheses in the quest to assign functional, structural and evolutionary information to DNA and protein sequences. As sequence databases continue to grow exponentially, it becomes increasingly important to repeat searches at frequent intervals, and similarity searches retrieve larger and larger sets of results. New and potentially significant results may be buried in a long list of previously obtained sequence hits from past searches.
ReHAB (Recent Hits Acquired from BLAST) is a tool for finding new protein hits in repeated PSI-BLAST searches. ReHAB compares results from PSI-BLAST searches performed with two versions of a protein sequence database and highlights hits that are present only in the updated database. Results are presented in an easily comprehended table, or in a BLAST-like report, using colors to highlight the new hits. ReHAB is designed to handle large numbers of query sequences, such as whole genomes or sets of genomes. Advanced computer skills are not needed to use ReHAB; the graphics interface is simple to use and was designed with the bench biologist in mind.
This software greatly simplifies the problem of evaluating the output of large numbers of protein database searches.
Advances in technology have increased the speed and reduced the cost of DNA sequencing. This has resulted in a dramatic increase in the number of sequences contributed by both large sequencing centres and individual laboratories to sequence databases. Public biological sequence databases are growing at an ever-increasing rate, with 9 million new sequences being added to GenBank from August 2002 to August 2003 alone . Currently, the GenBank database has almost 42 billion nucleotides from over 32 million sequences. The number of whole genome sequences of eukaryotes, prokaryotes and viruses is also increasing rapidly. Accordingly, tools like NCBI BLAST, which search those databases for sequences similar to a given query sequence, return larger and larger sets of results.
Sequence similarity searching is a powerful tool to help develop testable hypotheses in the quest to characterize genes and other DNA sequences and infer structural, functional or evolutionary relationships. Researchers interested in identifying new matches to query sequences, which may be a few genes or even whole genomes, must search through massive amounts of alignment data to retrieve new and interesting matches. In order to keep up with the growing databases, the researcher must submit the same queries periodically. However, the new results, no matter how significant, are often buried in a long list of results that were previously obtained on past searches.
ReHAB (Recent Hits Acquired from BLAST) is a new software package that was developed to address these problems. ReHAB performs PSI-BLAST  searches of a protein sequence database and keeps a database of all significant alignments ("hits") obtained; these searches are performed on a regular schedule against updated versions of the sequence database. It then compares the sequences in the new PSI-BLAST result with the ReHAB hits database to identify new matches resulting from recently deposited sequences. The complete ReHAB hits database can then be queried by date using a simple GUI to allow the researcher to easily identify new hits; these are highlighted, and pairwise or multiple alignments can be performed to assess the quality of the match. As well as filtering out results that have been found previously, the ReHAB browser can filter out hits against sequences that are identical to the sequences being submitted as queries (such as orthologs of the query sequence).
ReHAB is designed to be a practical tool for searching NCBI database updates with large numbers of query sequences. For example, our laboratory uses it with all open reading frames (ORFs) from fully sequenced poxvirus genomes (over 7000 query sequences). As the number of sequenced virus genomes continues to increase, the number of hypothetical ORFs of unknown function also expands. This is particularly true for large viruses like poxviruses, baculoviruses, and herpesviruses that possess many virulence genes that are not part of the core set of genes that define a virus family . There are also numerous core genes for which no known function has yet been identified; for example, of 49 completely conserved protein families in poxviruses, there are 11 with completely unknown function and at least 5 others with only poorly defined function.
Other programs have been previously created to deal with this particular issue, including DBWatcher , SEALS , Swiss-Shop , Sequence Alerting System  and BLAST Search Updater . However, WWW-based programs are not well suited to searching with large numbers of query sequences, and there may be concerns with a shut-down of service (as occurred with Sequence Alerting System) or allowing proprietary data out of a secure network. Other programs may be complicated to use, or require users to directly interact with UNIX operating systems. ReHAB is specifically designed for searching with large numbers of query sequences and can support a number of research groups; it also provides a user-friendly graphical interface. The client will run on most major operating systems including Mac OS X, Windows, Linux and Solaris.
ReHAB was implemented for the Java platform to simplify the support of multiple operating systems including Linux, Microsoft Windows, Solaris, and Mac OS X. Users initially access and launch the application (client) from a web page using Java Web Start, which also automatically downloads updated versions as they become available. This ensures users are taking advantage of improvements or added features in the latest software version. Furthermore, coding in Java allows interoperability with existing applications developed in our laboratory, including Base-By-Base . Our choice was reinforced by past successes with the Java platform and Java Web Start for implementation and distribution of programs such as VOCs , VGO , and Base-By-Base .
There are four types of information stored in the ReHAB database: (1) biological sequences and their annotations, both those used as queries in BLAST searches and those which have been returned as hit subjects; (2) information on each query/subject pair (hit), gathered from individual search results and alignment programs (including bit-score, date entered, and percent identity); (3) organizing information, such as which query sequences belong to which organisms; and (4) other caching information, used to speed performance of server-side program functions. To reduce the amount of required storage space, actual alignments are not stored, but are regenerated for presentation when the user selects the specific query sequence or query-target pair to be viewed. Query sequences, which are entered using a simple FASTA-like format that includes additional annotation information in the identifying line, need only be submitted once to ReHAB since they are stored for future search cycles.
The work of running PSI-BLAST searches is done in batch mode by the NCBI blastpgp program against a local copy of the NCBI non-redundant (NR) protein database. PSI-BLAST is performed for three iterations without filtering procedures (such as for low-complexity regions). Hits with an E-value less than 0.001 are used to generate the scoring matrix for the subsequent cycle. To increase speed, searches with query sequences that result in no new hits are terminated after the first cycle. Those with new hits scoring below the threshold continue to the third cycle. PSI-BLAST was chosen because it is a more sensitive search method than BLASTP. The searches do not need to be performed on the same machine on which the database or server components are installed. XML output from blastpgp is parsed and relevant information about each hit is stored in the ReHAB database. In addition to scores and identifying information, target sequences are copied into the ReHAB database to ensure that they are available for analysis in the future, even if they are no longer available from NCBI. This is important because, although NCBI does not actually remove sequences from its database, it may change the identifier of a sequence if it is corrected, updated, or merged with another identical entry. Any changes to a pre-existing entry are added to the database, but it is not registered as a new hit.
The server component consists of Java RMI classes that provide remote access to local facilities, and a loading program that registers those classes with an RMI Registry installed on the server from which the client will be downloaded. Requirement for the server is a system that can support Java 1.4.1 and MySQL 4.0.
The Java Swing client component allows a user to browse the information collected in the database by the back-end program. When the client is downloaded and launched from a website, it connects to the server and database specified in its configuration file.
The client program visually presents summary information about hits added to the database, and allows the user to quickly locate new, relevant hits and the sequences involved. There are five main views available in the client: (1) The management console lists the available databases, and has options for creating new databases or adding files to existing databases, (2) The Hits Browser window lists the organisms for which query sequences have been added in the database, and allows users to select filtering and highlighting options, (3) a Hits Summary, which displays the results in a table with highlighting to mark new hits, and (4) an HTML output or (5) a Hits Manager that displays detailed information about retrieved sequences and alignments.
Results and discussion
Finding new hits
Since all new hits are not necessarily significant, results are highlighted in different colors depending on the bit-score. The user can change the default threshold of the minimum bit-score, to show new hits scoring above this cut-off in red and new hits scoring below it in yellow. Since all query sequences that have new hits are highlighted, any that remain unhighlighted do not have new hits. However, unhighlighted queries may have significant hits from previous searches. The "Latest Hit" column indicates this fact: query sequences showing hits only from previous searches show an older hit date, and a bit-score of "0" in the "New Hit Score" column. Unhighlighted sequences with no information in the "Latest Hit" column do not have any hits in the database or they have been filtered out (see below).
Filtering out identical sequences
Unless a sequence has not been deposited in the public database, a sequence similarity search will return results including the query sequence itself, as well as nearly identical sequences that are orthologs of the query. ReHAB can block the highlighting of hit sequences that are also present in the query database when the "Don't Show My Own Sequences" option is selected; such sequences will not be shown or highlighted in the Hits Results window. However, these sequences and their alignments with the query can still be visualized in the HTML Report and Hits Manager windows, thus allowing the user to access all the available information. This feature becomes essential when new poxvirus genomes are added to the public database, since a large fraction of the queries will hit proteins in the new genome and signal a "new hit" report when there may be no other new hits in the database. Although these are clearly high scoring matches, they are expected and therefore must be masked in the analysis if the full value of ReHAB is to be realized.
Browsing by other criteria
In the browser window (Figure 3), the user can chose to browse by the annotation included in each sequence's information line. In the case of our poxvirus sequences, useful annotations are organism name and protein family (as determined in POCs ). Selecting an item from the "Group by Annotation" list loads the new category in the list on the left side of the window. This sorting allows the user to quickly find query sequences of particular interest. For example, one may be interested in looking at only sequences from the Ankyrin family. Results can then be viewed and analyzed as described above.
Setting up ReHAB with user selected sequences
The goal of this project was to build a software package to aid in the identification of new results returned from sequence similarity searches. To this end, we developed ReHAB, a tool that highlights new hits by comparing results from previously run searches to those with a recently updated database. ReHAB allows researchers to query the NR protein database with large numbers of sequences and can highlight, sort, and analyze results in a user-friendly graphical interface. It can also be used to rapidly create multiple alignments with any set of sequences returned by a BLAST search. This enables researchers to recognize new significant sequence matches in the mass of results generated by high throughput database search protocols.
Availability and requirements
Project name: ReHAB
Project home page: http://www.virology.ca/pbr/rehab
Operating systems: All platforms supporting Sun's JRE version 1.4.1 or compatible
Programming languages: Java, SQL
Other requirements: Java 1.4 or higher
License: GNU General Public License
Restrictions for non-academic use: Contact corresponding Author
This work was funded by NIAID/DARPA grant U01 AI48653-02 and Canadian NSERC Strategic Grant STPGP 269665-03. We would like to thank Angelika Ehlers for systems administration and Dr. Rachel Roper for helpful insights.
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank: update. Nucleic Acids Res 2004, (32 Database issue):D23–26. 10.1093/nar/gkh045Google Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Upton C, Slack S, Hunter AL, Ehlers A, Roper RL: Poxvirus orthologous clusters: toward defining the minimum essential poxvirus genome. J Virol 2003, 77(13):7590–7600. 10.1128/JVI.77.13.7590-7600.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Sequence Alerting Systemm[http://www.bork.embl-heidelberg.de/Alerting/]
- Boone M, Upton C: BLAST Search Updater: a notification system for new database matches. Bioinformatics 2000, 16(11):1054–1055. 10.1093/bioinformatics/16.11.1054View ArticlePubMedGoogle Scholar
- Brodie R, Smith AJ, Roper RL, Tcherepanov V, Upton C: Base-By-Base: Single nucleotide-level analysis of whole viral genome alignments. BMC Bioinformatics 2004, 5(1):96. 10.1186/1471-2105-5-96PubMed CentralView ArticlePubMedGoogle Scholar
- Ehlers A, Osborne J, Slack S, Roper RL, Upton C: Poxvirus Orthologous Clusters (POCs). Bioinformatics 2002, 18(11):1544–1545. 10.1093/bioinformatics/18.11.1544View ArticlePubMedGoogle Scholar
- Upton C, Hogg D, Perrin D, Boone M, Harris NL: Viral genome organizer: a system for analyzing complete viral genomes. Virus Res 2000, 70(1–2):55–64. 10.1016/S0168-1702(00)00210-0View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Software Suite. Trends Genet 2000, 16(6):276–277. 10.1016/S0168-9525(00)02024-2View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–7. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.