Java GUI for InterProScan (JIPS): A tool to help process multiple InterProScans and perform ortholog analysis
© Syed and Upton; licensee BioMed Central Ltd. 2006
Received: 17 August 2006
Accepted: 20 October 2006
Published: 20 October 2006
Recent, rapid growth in the quantity of available genomic data has generated many protein sequences that are not yet biochemically classified. Thus, the prediction of biochemical function based on structural motifs is an important task in post-genomic analysis. The InterPro databases are a major resource for protein function information. For optimal results, these databases should be searched at regular intervals, since they are frequently updated.
We describe here a new program JIPS (Java GUI for InterProScan), a tool for tracking and viewing results obtained from repeated InterProScan searches. JIPS stores matches (in a local database) obtained from InterProScan searches performed with multiple versions of the InterPro database and highlights hits that have been added since the last search of the InterPro database. Results are displayed in an easy-to-use tabular format. JIPS also contains tools to assist with ortholog-based comparative studies of protein signatures.
JIPS is an efficient tool for performing repeated InterProScans on large batches of protein sequences, tracking and viewing search results, and mining the collected data.
Recent advances in DNA sequencing technology have led to an unprecedented and rapid accumulation of genomic data [1–3]. Although this huge amount of data is immensely useful for a variety of comparative -omics studies, it also presents significant challenges in the areas of data management and analysis, as databases need to be designed to accommodate future growth. Comparative analysis tools must also be able to handle increasing amounts of data; the processing power of computers may be increasing, but such analyses are often computationally intensive. Another aspect of using these tools that is sometimes forgotten is that analyses such as BLAST similarity searches  or InterPro motif scans [5–8] are not one-shot experiments. Since the sequence/motif databases they use are continually changing, results quickly become obsolete and thus searches must be repeated at frequent intervals.
The manual running of such analyses on a regular basis may not present a problem to a researcher who is only interested in a few specific genes. However, larger-scale query sets (e.g. an entire gene family), may contain so many sequences that the process becomes a highly tedious chore. One serious consequence of this is that such analyses are often performed only sporadically, and thus significant new database matches are not discovered in a timely fashion. We designed the program Recent Hits Acquired from BLAST (ReHAB)  to automate PSI-BLAST  searches and help mine the results. ReHAB has the following features: 1) it automatically performs regular PSI-BLAST searches on large numbers of query proteins; 2) it allows the user to browse the search results, via a simple interface; 3) it highlights new database hits, distinguishing them from the large volume of unimportant PSI-BLAST output; and 4) it assists with further investigation of the results (comparing orthologs and creating multiple sequence alignments (MSA) for selected hits.)
Along with similarity searches such as BLAST and FASTA , one of the most useful methods of predicting protein function is examining a sequence for the presence of signature motifs. Most genomics researchers are probably familiar with the PROSITE  and Pfam  databases. InterPro [5–7] is a searchable super-database that integrates a variety of signature-based databases and can be queried using a sequence via the InterProScan tool. Since the InterPro database is subject to regular updates because new motifs are discovered and old ones refined, past searches should be repeated with each database release. Searches should be performed using all available members of a particular protein family, as this increases the overall chance of matching a database protein signature. InterProScan can be operated via a web interface  and although a locally installed version can run large numbers of proteins in batch mode, the reviewing of results can be extremely tedious and time-consuming. In addition, the results must be viewed individually or parsed by a separate computer program.
These considerations prompted us to design a new program, Java GUI for InterProScan (JIPS), to aid in the analysis of protein sequences by InterProScan and thus alleviate these problems. Specifications for the software included: 1) an interface to simplify batch runs and analyses; 2) a mechanism to flag new signature matches for the user; 3) tools to assist in ortholog comparisons and further analysis; 4) the ability to export signatures as annotations to the query protein. JIPS stores the query sequences together with the results produced by searching the InterPro database in local JIPS databases.
JIPS was implemented using Java to support multiple operating systems (including Mac OS X, Linux, Solaris and Microsoft Windows), and to ensure compatibility with other Java-based Viral Bioinformatics Resource Center  applications, including the Virus Orthologous Clusters database (VOCs) [14, 15], and Base-By-Base (BBB) [14, 16]. Users initially access and launch the application (JIPS client) from a web page using Java Web Start (JWS). A local application is then created on the user's computer; updated versions of the software are automatically downloaded as they become available.
Currently, since the InterPro database is updated on approximately a 3 month cycle, the downloading of new versions is performed manually. Similarly, each run of the proteins in a JIPS database against a new InterPro database is initiated using the administrator version of the JIPS client. This process takes only a few mouse clicks and allows for confirmation that the previous run, which may take several days depending on the number of proteins, has completed correctly.
The JIPS client is a Java Swing-based GUI that provides the user with an intuitive interface to browse InterProScan results, and allows managers to update the JIPS databases as required (the user must log in as an administrator to perform these functions; see below). The client contains five main components, arranged as follows (in the VBRC implementation): 1) the JIPS management console that lists all available local JIPS databases, and has options for creating/deleting databases or adding/editing biological sequence data from FASTA-formatted files or from the VBRC VOCs database to existing JIPS databases; 2) the JIPS Virus/Organism Browser window that displays all the organisms in a selected JIPS database and allows users to set viewing options; 3) the Summary of InterPro Hits window that displays the list of query genes from the selected organism, highlighting those genes which have new InterProScan hits; 4) the JIPS Hits Manager window, which displays detailed information about the hits for selected genes; and 5) the JIPS Orthologs Comparison window that allows users to compare the signature matches for protein orthologs.
Each database node in the database pane can be expanded by clicking on the adjacent arrow to show two child nodes. The jobs node is used by the administrator to initiate a new InterProScan search for each of the proteins in the selected JIPS database, and to check the status of running jobs. The query sequences node lists the proteins in the selected database. An administrator can group these sequences into gene families by selecting the genes and choosing "Add Family ID to Selected Sequences" from the Action menu; the Action menu changes according to items selected in the database pane.
Although we have focused on using the VOCs database as the source of protein sequences for the JIPS databases, the client (running administrator privileges) can be used to load large numbers of sequences (fasta multi-sequence format) into JIPS. The process is menu driven and allows the administrator to select the name to be associated with the protein set.
The JIPS server accepts requests from the JIPS client over a specific communications port. Requests are classified into two categories: data and computation request messages. After receiving a data request message, the server retrieves the requested data from the JIPS database server and returns it to the client. Computation request messages are used to manage server-side jobs (InterProScan searches) that require significant time to execute. Software requirements for the JIPS server are Java 1.4 and MySQL 4.0, with locally installed InterProScan software containing all InterPro databases. Detailed information is available on installing the JIPS server locally ; this is required if users wish to enter their own protein sequences into the JIPS system.
JIPS Database Server
The JIPS database server is a MySQL relational database server. Within the VBRC implementation of JIPS, we have created separate databases for a series of taxonomic virus families (e.g. Poxviridae), each containing protein sequences from all fully-sequenced viruses belonging to the family. An alternative arrangement of sequence categories could just as easily be Principal Investigator, Graduate Student, Favorite Proteins, but neither categorization is necessary. JIPS is also capable of storing data about relationships between query proteins in its database. This ortholog information can either be obtained automatically (i.e. when viral proteomes are imported into JIPS from our VOCs database) or can be entered manually by an administrator (discussed below). JIPS is able to use these similarities to quickly sort the results (e.g. returning all available hits for poxvirus DNA ligases.)
Types of information stored in each JIPS database within the JIPS database server. Signatures that do not have an InterPro id are also stored
Data stored in table
Name, genome id, and GenBank accession number
Name, gene id, protein sequence, and gene family information
InterPro id, signature id, and date when hit was first recorded
InterPro id, name, and type
Signature database id, name and type
JIPS results and discussion
JIPS was conceived as a tool to help biologists manage and analyze the results generated by large numbers of InterProScan searches. It takes considerable hands-on time for a researcher to evaluate the results of even one InterProScan search when more than a few signatures are hit, and this problem is compounded when multiple proteins are searched. Reviewing results of repeated scans performed with different versions of InterPro is similarly tedious and time-consuming. Therefore, a primary goal in creating JIPS was to provide users with a tracking tool to quickly summarize differences between repeated searches. A second objective was to assist researchers in analyzing these results through comparative genomics.
JIPS is particularly useful for performing a series of InterProScan searches with a group of diverse protein orthologs. Investigating an InterPro signature that only appears in one, or a few, of the orthologs can be very productive. In some cases, comparison of the sequence containing the motif to the other orthologs may lead to the researcher detecting a variation of the signature pattern in these sequences. This would suggest that the original hit is significant and that the signature pattern may need to be generalized to reflect this new set of proteins. On the other hand, the researcher may conclude that the signature is indeed only present in the single sequence, suggesting that it is spurious (i.e. a random match). In both situations, running InterProScan on only a single member of the group would yield different, and possibly misleading, results.
Browsing JIPS Hits
Although the JIPS database is itself a repository of InterPro search results, JIPS also provides a function (Save as BBB File button) that adds the signatures as comments (annotations) to a protein sequence and writes a file in BBB format. This allows simple reviewing and sharing of final results. The files are saved on the user's local computer and can be independently loaded into BBB. If required, the user can edit/add/delete these comments from within BBB and the other comment-associated features of this program are also available to the user.
The signatures can also be written to a BBB file as part of the MSA generated by the Align Orthologs in BBB feature (Figure 7). In this case, to conserve space in the viewer, the comments are only written for the primary protein
InterPro is an extremely valuable and complex resource that integrates a wide variety of protein signature databases. JIPS was designed to mine the information in a comparative fashion from multiple InterProScan searches, thereby relieving the biologist of a variety of tedious information management jobs. To this end, JIPS is a powerful but simple-to-use tool that helps bioinformaticians and biologists navigate and analyze the volumes of data with which they are faced following medium, which may be a single family of orthologs, and large scale InterProScan searches. JIPS goes beyond data management and highlights new signatures matches for the user. It also integrates a series of tools to allow comparison of InterProScan searches for multiple proteins.
Through the viral databases maintained by Viral Bioinformatics – Canada , JIPS will support a large community of virologists, however, local installations will make it useful for a much wider audience.
Availability and requirements
The authors would like to acknowledge the many programmers who have contributed to VBRC projects over the years and other authors of Open Source software (http://www.opensource.org). The authors thank Cristalle Watson for critically reviewing the manuscript. This work was supported by a NIH/NAID Contract HHSN266200400036C and by a NSERC Strategic grant to CU.
- Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res 2006, 34: D332-D334. 10.1093/nar/gkj145PubMed CentralView ArticlePubMedGoogle Scholar
- Genome Sequencing Projects[http://www.genomesonline.org]
- GenBank statistics[http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucleic Acids Res 2005, 33: W116-W120. 10.1093/nar/gki442PubMed CentralView ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Biswas M, Bradley P, Bork P, Bucher P, Copley R, Courcelle E, Durbin R, Falquet L, Fleischmann W, Gouzy J, Griffith-Jones S, Haft D, Hermjakob H, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lopez R, Letunic I, Orchard S, Pagni M, Peyruc D, Ponting CP, Servant F, Sigrist CJ, InterPro Consortium: InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 2002, 3: 225–235. 10.1093/bib/3.3.225View ArticlePubMedGoogle Scholar
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, Copley R, Courcelle E, Das U, Durbin R, Fleischmann W, Gough J, Haft D, Harte N, Hulo N, Kahn D, Kanapin A, Krestyaninova M, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McDowall J, Mitchell A, Nikolskaya AN, Orchard S, Pagni M, Ponting CP, Quevillon E, Selengut J, Sigrist CJ, Silventoinen V, Studholme DJ, Vaughan R, Wu CH: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33: D201-D205. 10.1093/nar/gki106PubMed CentralView ArticlePubMedGoogle Scholar
- Tutorial for InterPro[http://www.embl-ebi.ac.uk/interpro/tutorial.html]
- Whitney J, Esteban DJ, Upton C: Recent Hits Acquired by BLAST (ReHAB): a tool to identify new hits in sequence similarity searches. BMC Bioinformatics 2005, 6: 23. 10.1186/1471-2105-6-23PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 2000, 132: 185–219.PubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, 34: D227-D230. 10.1093/nar/gkj063PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138-D141. 10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
- Web-based InterProScan at EBI[http://www.ebi.ac.uk/InterProScan]
- The Viral Bioinformatics Resource Center[http://www.vbrc.org]
- Ehlers A, Osborne J, Slack S, Roper RL, Upton C: Poxvirus Orthologous Clusters (POCs). Bioinformatics 2002, 18: 1544–1545. 10.1093/bioinformatics/18.11.1544View ArticlePubMedGoogle Scholar
- Brodie R, Smith AJ, Roper RL, Tcherepanov V, Upton C: Base-By-Base: single nucleotide-level analysis of whole viral genome alignments. BMC Bioinformatics 2004, 5: 96. 10.1186/1471-2105-5-96PubMed CentralView ArticlePubMedGoogle Scholar
- Client/Server Architecture[http://www.sei.cmu.edu/str/descriptions/clientserver_body.html]
- Installing JIPS[http://www.virology.ca/techDoc/softwaredevelopment/jips/install]
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Viral Bioinformatics – Canada[http://www.virology.ca]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.