Server-side architecture
An overview of the PSAT architecture is shown in Fig. 1. As of this writing, the PSAT back end is implemented in Python 2.6 on a Redhat Linux cluster comprising 8 Dual Intel Xeon X5690 6-core, 3.46GHz processors and 1 TB disk drive. Data pertaining to job management are managed using Python SQLite, and persistent storage is managed using MySQL. Once the web server accepts a PSAT job requested by a user, it communicates with the distributed computing layer to perform all computing tasks relevant to the PSAT job. These computing tasks executed through the distributed computing APIs may involve any or all of the software packages currently supported by PSAT, which are InterProScan [13], SignalP [14], EFICAz [15], and BLAST+ [16]. In our architecture, a software package may invoke any of the databases installed on the PSAT server, including KEGG [17], MetaCyc [18], STRING [19] and MVirDB [20]. Currently, only Blast + can invoke these databases as it is the only annotation tool installed on PSAT designed to search through a local but nonspecific biological sequence database. The KEGG data set was acquired from the Kanehisa laboratory through an organizational license. Current versions of these codes and data sources can be found on the PSAT home page (http://psat.llnl.gov/psat/).
Django web application
PSAT is powered by Django, which is a web application framework available through open source [21]. The Django web framework adopts the standard “model, view, controller (MVC)” architectural pattern, whereby the model defines the data, the view controls data presentation to the user, and the controller depends on the model and the view to perform the necessary operations on data when interpreting a user request [22]. Using such a framework in our web application removes the dependencies between the model and view, which in turn enhances source code reusability and software stability [23]. In our Django web application, we can add a new database or a new software package support without modifying a significant portion of our source code.
Distributed computing
All tasks related to parallel computing in PSAT are handled by Celery [24] and RabbitMQ [25]. RabbitMQ is an open-source message broker software, which offers robust, highly scalable asynchronous processing [25], whereas Celery distributes the jobs to its workers as it consumes the message sent by RabbitMQ [24]. Fig. 2 shows the real-time messaging system between the Django web application, RabbitMQ, Celery workers, and the local MySQL database. When the backend Django web application receives a job request, Celery creates a task queue to wrap up the job execution function through its decorator, and pushes that to the RabbitMQ server. RabbitMQ acts as an exchange server distributing the jobs to 64 celeryd workers over 8 compute nodes, based on processor availability. When a celeryd worker receives a job request, the worker processes it right on the node. All the job processing is handled asynchronously in the background without causing front-end web page delay (hanging). On the server side, data provenance is captured to ensure that sequence analysis results are generated in a reproducible and systematic manner.
Each core of the Linux cluster can serve as a celery worker to perform a specific computational task. Jobs executed on the 64 cores are run in parallel and are distributed into multiple subtasks handled by multiple computing threads. When sequence analysis by all celery worker threads is complete in the background, PSAT automatically combines the job results for user download through the PSAT website or a link specified in a notification email.
PSAT package support
PSAT provides a centralized computational resource for a variety of protein sequence annotation tools. PSAT supports a suite of software packages designed to predict enzyme functions for a given set of protein sequences, most notably EFICAz 2.5, which uses machine learning algorithms to automatically infer the enzyme function of a protein [15]. MetaCyc Blast and KEGG Blast are also available to derive similar information by running BLAST+ against the open-source MetaCyc and licensed KEGG databases [17, 18], respectively. Combining the results of EFICAz, MetaCyc blast, and KEGG blast analyses produces lists of Enzyme Commission (EC) numbers that putatively describe the functions of the query proteins. In a summary output file, for each protein, all predicted EC numbers are listed numerically followed by the evidence (i.e., EFICAz, KEGG blast, MetaCyc blast) for that EC. No attempt is made to rank order the evidence items or to combine them into a single prediction. The PSAT output enables comparison of annotation results across different annotation methods. The predicted EC numbers are then linked to metabolic pathways by means of a RESTful interface to the KEGG API [26] to retrieve up-to-date enzyme-to-pathway mappings.
To supplement the primary goal of whole proteome enzyme function prediction of PSAT, we have also included functional annotation codes, SignalP 4.0 [14] and InterProScan 5 [13], to the meta-server. Furthermore, the String [19] and MVirDB [20] databases are now also available for a BLAST+ search on PSAT.
User interfaces and access
PSAT was built using a thin-client approach, in which the entire MVC logic resides on the server side. Hence, only a web browser is required in order to run a sequence analysis on PSAT. An online registration for a user account is available for all PSAT users at http://psat.llnl.gov/psat. User authentication is required in order to submit annotation jobs to the PSAT server. When submitting a new job, a user is required to either copy the fasta sequences onto the job submission form or upload a file containing a set of amino acid sequences in fasta format. Then, when the job is finished, an automated email with a job result download link is sent to the user.
Summary results are presented as a tabbed text file containing computational predictions and reliability metrics from the set of tools that were run for a given job. Because the user’s input FASTA sequences are processed by PSAT in parallel, individual computations will finish out of order with respect to the original input FASTA file. Therefore, prior to processing, the user’s headers are pre-pended with a sequential numeric identifier to enable re-establishment of the original ordering upon completion of the job. The (voluminous) raw data results are stored on our server in persistent storage and are available upon request for up to 3 months following completion of a job.
All PSAT users must login using their credentials at the beginning of each PSAT session. Once a PSAT user has successfully logged in, a homepage will be displayed dynamically, showing a list of the recently submitted PSAT jobs by the authenticated user and the corresponding job status. In order to control server load and file transfer volumes, we limit the number of submitted protein sequences to 800. However, users are encouraged to contact the authors regarding submission of jobs in excess of 800, and we welcome jobs that require enzyme prediction over whole bacterial proteomes or non-specific protein sets involving up to 10,000 protein sequences per job.
Genomic sequence for the case study
The genome of Herbaspirillum sp. strain RV1423 (henceforth RV1423), which was isolated from underground water contaminated with alkane and aromatic hydrocarbons, has already been sequenced in a whole-genome shotgun project [27]. The draft genome of RV1423 obtained from NCBI [28] comprises 131 contigs under the accession numbers CBXX010000001 to CBXX010000131. This newly sequenced genome, which has been reported previously to potentially degrade naphthalene [27], was selected for our case study to demonstrate the ability of PSAT to derive functional annotations and link them to metabolic pathways that may be present in a draft genome that has not yet been fully annotated.
Pre-processing of genomic sequence
A previous study has identified a set of 5732 potential protein-coding genes in RV1423 by using the RAST server version 4.0 [27]. A renumbered and newer version of RAST server 2.0 [29, 30] was used in our study, and generated a set of 5649 features that are potentially protein-coding genes. These predicted genomic features were translated to amino acid sequences, which served as input for PSAT. EC data arising from the PSAT processing were subsequently re-formatted for input to EC2KEGG [31].
EC2KEGG analysis and statistical significance
The pathways inferred from results generated by PSAT may be over- or under-represented when compared to a reference genome. To evaluate the statistical significance of the inferred metabolic pathways, we used EC2KEGG (available at http://sourceforge.net/projects/ec2kegg) to compute the false discovery rate (FDR) of each pathway [31]. Any pathway with an FDR adjusted p-value below 0.05 is considered statistically significant. Currently, there is only one reference genome for the genus Herbaspirillum in KEGG: H. seropedicae. Hence, the genome of H. seropedicae was chosen as the reference genome for statistical evaluation.