Sanjeevini: a freely accessible web-server for target directed lead molecule discovery

Background Computational methods utilizing the structural and functional information help to understand specific molecular recognition events between the target biomolecule and candidate hits and make it possible to design improved lead molecules for the target. Results Sanjeevini represents a massive on-going scientific endeavor to provide to the user, a freely accessible state of the art software suite for protein and DNA targeted lead molecule discovery. It builds in several features, including automated detection of active sites, scanning against a million compound library for identifying hit molecules, all atom based docking and scoring and various other utilities to design molecules with desired affinity and specificity against biomolecular targets. Each of the modules is thoroughly validated on a large dataset of protein/DNA drug targets. Conclusions The article presents Sanjeevini, a freely accessible user friendly web-server, to aid in drug discovery. It is implemented on a tera flop cluster and made accessible via a web-interface at http://www.scfbio-iitd.res.in/sanjeevini/sanjeevini.jsp. A brief description of various modules, their scientific basis, validation, and how to use the server to develop in silico suggestions of lead molecules is provided.


Background
One of the main challenges in structure based drug discovery is to utilize the structural and chemical information of the drug targets and their ligand binding sites to create new molecules with high affinity and specificity, bioavailability and possibly least toxicity [1]. Computer aided drug discovery, in this context, is proving to be particularly invaluable . The rapid ascent and acceptance of this methodology has been feasible due to advances in software and hardware. Sanjeevini server has been developed as an enabler for drug designers to address issues of affinity and selectivity of candidate molecules against drug targets with known structures. Sanjeevini comprises several modules with different functions, such as automated identification of potential binding sites (active sites) of ligands on the biomolecular target [90], a rapid screening of a million molecule database/natural product library [91] for identifying good candidates for any target protein, optimization of their geometries [92] and determination of partial atomic charges using quantum chemical methods [92,93], assignment of force field parameters to ligand [94] and the target protein/DNA [95], docking of the candidates in the active site of the drug target via Monte Carlo methods [90,96], estimation of binding free energies through empirical scoring functions [97][98][99], followed by rigorous analyses of the structure and energetics [100,101] of binding for further lead optimization. The computational pathway created rolls over into an automated pipe-line for lead design, if desired. The software takes three dimensional structure of the target protein or nucleotide sequence of DNA as an input; the remaining functionalities are built into the software suite to arrive at the structure and desired binding free energy of the protein/DNA-candidate molecule complex. The methodology treats biomolecular target and candidate molecules at the atomic level and solvent as a dielectric continuum. Validation studies on a large number of protein-ligand and DNA-ligand complexes suggest that performance of Sanjeevini is at the state of the art. The software is freely accessible over the net. We describe here as to how to harness the server for accelerating lead molecule discovery.
The front end of Sanjeevini website is shown in Figure 1 and the overall architecture of the software suite is given in Figure 2. Sanjeevini is a user friendly web interface where the demands on the user have been reduced to uploading of the target protein coordinates file or DNA sequence and the ligand molecule. The software protocol automatically standardizes the input formats of the biomolecule. Additionally, it determines the branch of pathway ( Figure 2) that has to be followed (protein with known binding sites/protein with unknown binding site) by analyzing the target protein file and redirects the job instance for the same. Thus, any kind of overhead to the user to pre-format the input files for docking and scoring is removed. User can upload the desired ligand molecule either by drawing the molecule or by cultivating the molecular databases incorporated into Sanjeevini. There are three different molecular databases in-built in Sanjeevini namely NRDBSM containing 17000 molecules [82], a million molecule database containing one million small molecules, and a natural product database with 0.1 million natural products and their derivatives [91]. The molecules present in the database are Lipinski compliant [102,103].
Sanjeevini database of small organic molecules and the natural product database are localized on the linux clusters. Based on the user's choice of the physicochemical properties of interest including molecular weight, LogP, number of hydrogen bond donor and acceptor atoms, overall formal charge of the molecule and many more, a list of all the molecules falling in the ranges provided by the user are displayed in a downloadable form. However, if a self drawn molecule is uploaded by the user, then one can check its bioavailability by clicking the Lipinski's rule option in Sanjeevini. The program predicts the physicochemical properties (Lipinski's rules) of the uploaded ligand molecule. If the binding site of the uploaded target protein is known and the coordinates of the protein-ligand complex are available in RCSB [104], then one can quickly check the binding affinity of the uploaded ligand and can also scan databases of small organic molecules [91] against any target protein by clicking the RASPD option (Mukherjee and Jayaram, Manuscript in preparation). The RASPD module takes 10-15 minutes in screening the database against a target protein. The docking and scoring module of Sanjeevini performs a series of computational steps such as preparation of the protein and the ligand from the files uploaded, docks the candidate molecule at the bind-ing site via a Monte Carlo algorithm, minimizes and scores the docked complex, in an automated mode. The average time taken in the protein and ligand preparation and the Monte Carlo docking program ranges from 1-3 minutes. The Monte Carlo docking program is implemented in a parallel processing mode. The docked complexes are further minimized using the parallel version of Sander module of AMBER [105] which scales best on 32 processors. Sanjeevini programs run on linux clusters having infiniband network resources which facilitate a high through put distribution of the data across the various nodes. On an average, the total time taken by the complete docking and scoring protocol ranges from 5-20 minutes depending on the size of the protein and the ligand. The above time frames reported correspond to performance on a 32 processors cluster. A benchmark test on 8, 16 and 32 processors showed that the entire docking and scoring module scaled best on 32 processors. Memory consumption and I/O issues are minimal during program execution. The time taken also depends on the load on the server. Currently 80 processors are dedicated for jobs submitted to Sanjeevini. For each molecule five docked structures representing the poses of the molecule in the active site along with the binding affinity are emailed to user. However, if the binding sites are unknown in the protein, the AADS [90] option predicts ten hot spots/binding sites in the protein and docks the uploaded ligand molecule at all the ten predicted sites. Five docked structures representing the poses of the ligand molecule in the binding site along with their binding free energies are reported back to the user. The above docked structures may be treated as a reference protein-ligand complex which can be given as an input to scan the publicly accessible version of commercially-available compound database http://zinc. docking.org/ through RASPD protocol to arrive at suggestions of additional hit molecules against the target protein with unknown binding site information. A new cycle of design, docking and scoring for an iterative improvement of the candidate molecule can be initiated for desired affinities and scaffolds.
Target-molecule complexes with high binding affinity can be subjected to molecular dynamics simulations [101] in propitious cases, to investigate the effect of conformational flexibility, solvent, salt and entropic factors. About 100 or more structures may be collected over the trajectories and converged average binding free energies of the complexes may be obtained. Further post facto energy component analyses of the targetligand complex can help in chemical modifications on the candidate molecule for enhancing the binding affinities. Different modules described above have been incorporated, which work in a pipeline as depicted in the architecture (Figure 2).

A brief description of a few frequently used modules in Sanjeevini
Sanjeevini software comprises several modules with high accuracies, working in a pipeline, and given a protein/ DNA as the drug target, and a ligand molecule which is optional to the software suite, it helps in designing lead molecules.

Scoring function
Sanjeevini comprises three scoring functions christened Bappl [97], Bappl-Z [98] and PreDDICTA [99] for protein-ligand complexes, Zn containing metalloproteinase-ligand complexes and DNA-ligand complexes respectively. Bappl is an all atom energy based empirical scoring function comprising electrostatics, van der Waals, desolvation and loss of conformational entropy of protein side chains upon ligand binding. Bappl-Z scores protein-ligand complexes with Zn as the metal ion in the binding site in which a non-bonded approach to model the interactions of the zinc ion with all other atoms of the protein-ligand complex has been employed along with the four terms described for Bappl. PreDDICTA is an all atom energy based scoring function which computes binding affinity of a DNA oligomer with a non-covalently bound drug molecule in the minor groove. The function is a combination of electrostatics, steric complementarities, entropic and solvent effects, including hydrophobicity. There are very few high accuracy scoring functions reported in literature for DNA-ligand complexes and, PreDDICTA thus provides a strong platform for designing molecules binding specifically to DNA. The program takes DNA-ligand complex as an input and outputs binding free energies associated with the complex.

Docking Module
The docking module of Sanjeevini comprises three programs christened ParDOCK [96], AADS [90] and DNA-Dock [96,99]. ParDock is an all atom energy-based Monte Carlo, protein-ligand docking algorithm. The module requires a reference protein-ligand complex (target protein bound to a reference ligand at its binding site) as an input along with the candidate molecule to be docked. The algorithm docks the ligand molecule to the reference protein and outputs five docked structures representing different poses of ligand molecule along with the predicted binding free energies of the docked poses using Bappl/BapplZ scoring function. The program is in-built into Sanjeevini software for docking ligand molecules to the target protein for which crystal structure of the protein-ligand complex is available in literature. AADS (An automated active site identification, docking and scoring protocol for protein targets based on physico-chemical descriptors) predicts all potential binding sites in a protein and docks the input ligand molecule at the top ten predicted binding sites. Eight docked structures are generated at each of these ten sites and scored using Bappl/BapplZ scoring function. Five out of the eighty structures, favorable energetically are emailed back to the user along with the binding free energy values. The program has been tested previously [90] on more than 600 protein-ligand complexes with known binding site information. AADS predicted the true binding sites within the top ten sites with 100% accuracy. A blind docking on 170 protein targets [90] with known binding sites and known experimental binding free energies associated with the complexed ligands was also performed. The methodology restored the binding pose of the ligands to their native binding sites in the above 170 complexes with an accuracy of 90% for the top ranked docked structure and the predicted binding free energies of the top most docked structure correlated well with experiment (correlation coefficient~0.82; see Figure F4 of [90]). The RMSD (Root Mean Square Deviation) between crystal and the docked structures in more than 80% of the cases is within 2 Å ( Figure F5 of [90]). DNADock is an all atom Monte Carlo based docking algorithm which has been implemented in parallel mode and is incorporated into the software suite. The program takes nucleotide sequence and the candidate ligand molecule as input, generates canonical A or B DNA [123] or an average molecular dynamics B DNA structure [124,125] based on the user's choice, docks the candidate ligand molecule in the minor groove of DNA, and scores the docked structures through PreDDICTA scoring function. Five docked structures with their binding free energy values are reported back to the user.
RASPD (A rapid identification of hit molecules for target proteins via physico-chemical descriptors) is a computationally fast protocol for identifying hit molecules for any target protein. The methodology establishes complementarity in physico-chemical descriptor space of the target protein and the candidate molecule via a QSAR type approach and rapidly generates a reasonable estimate of the binding energy. The accuracies of RASPD are discussed elsewhere (Mukherjee and Jayaram manuscript in preparation).

Results and discussion
The scoring functions of Sanjeevini software were validated on a large dataset comprising 366 protein-ligand complexes, Zn-containing metalloproteinase-ligand complexes and DNA-ligand complexes which includes 335 crystal structures and 31 modeled structures. The PDB IDs of the validation dataset with the experimental and predicted binding free energies are provided in Additional file 1. A correlation coefficient of r = 0.88 was obtained between the experimental and predicted binding free energies on the above dataset as shown in Figure 3. Some of the published results of scoring functions for protein-ligand complexes originating in physics based or knowledge based methods include DFIRE (r = 0.63) [106], × SCORE (r = 0.77) [107], SMoG (r = 0.79) [108], BLEEP (r = 0.74) [109], PMF(r = 0.78) [110], SCORE (r = 0.81) [111], LUDI (r = 0.83) [112], ChemScore (r = 0.84) [113], Ligscore (r = 0.87) [114], KGS comprising of both X-Score and PLP (r = 0.82) [115]. Sanjeevini scoring function for protein-ligand complexes yielded a correlation coefficient (r) of 0.87. There are very few scoring functions reported in literature for DNA-ligand complexes. One among them is the KS score (r = 0.68) [116]. Sanjeevini scoring function for DNA-ligand complexes has been tested on 39 DNA-ligand complexes involving no training which yielded a correlation coefficient of 0.90. PreDDICTA has been reported to perform better than some of the existing scoring functions for DNAligand complexes in literature [116]. The docking module of Sanjeevini has been validated on a dataset of 335 DNA/protein targets with known binders and structures and known experimental binding free energies. The predicted binding free energies of the top ranked docked structures reported by Sanjeevini (Additional File 2) were compared with experiment ( Figure 4) and also the RMSDs (root mean square deviations) between the crystal structures and the top ranked docked structures ( Figure 5). The high accuracies obtained by Sanjeevini as evident from a correlation coefficient of r = 0.83 in Figure 4 and RMSDs lying within 2 Å in Figure 5, provide a strong platform to design drug-like molecules.
For protein-ligand complexes Autodock Vina [5] has been reported to predict the top most structure within 2Å RMSD from the native complex with 80% accuracy. In a recent work of Zhong-Ru Xie et al. DrugScore CSD scoring function was compared with some of the known scoring functions in literature [122] and was reported to perform better than others giving an accuracy of 87% in predicting the top most docked structure within an RMSD of 2Å from crystal structure. The docking and the scoring module of Sanjeevini yielded 90% accuracy in predicting the top most docked structure within 2Å RMSD from crystal structure on a large dataset (335 complexes: Figure 5).

Case studies
While designing new molecules for a target protein/ DNA, user may have experimental (K i /IC50/K d ) values of known binders reported in the literature. Before designing new candidate molecules against a target protein/ DNA, we propose to the Sanjeevini user to predict the binding free energies of the known binders and plot a correlation graph between the experimental and predicted binding free energies. This would give a relative understanding of the predicted binding free energies vis-a-vis experiment, helping in discriminating between drug-like and non-drug-like molecules against a given target. With this proposal, we present a few case studies on an important class of drug targets which can set examples for the Sanjeevini users to utilize the same methodology on various drug targets to come up with suggestions of hit molecules.

Case 1: Protein targets with known binding site information
Majority of drugs deposited in RCSB have been cocrystallized with a single protein or more than one protein [126] yielding the drug binding site for the target protein. The first case study was on protein targets for which structures of the protein-ligand complexes were available in the database specifying the binding site. Serine proteinases play an important role in many biological processes [127]. For instance trypsin helps in digestion and thrombins in the blood coagulation cascade. The above class of enzymes is implicated in a wide spectrum of diseases which are related to a malfunctioning in this regulation. We predicted the binding energies of 12 trypsin binding molecules. In addition, some of the known synthetic inhibitors [128] of bovine pancreatic trypsins, PDBID 1S0R were also docked and scored. The predicted binding free energies associated with the top ranked docked complex for all the above data are shown in Table 1. A correlation coefficient of r = 0.92 was obtained between the experimental and predicted binding free energies as illustrated in Figure 6.

Case 2: Input as a target protein with unknown binding site and a candidate ligand
When the user has the 3D coordinates of a target protein, either as deposited in the protein data bank or as a modeled structure with no binding site information, the AADS pathway of Sanjeevini gets pre-selected to come up with suggestions of hit molecules. We performed a case study on the trypsin binding inhibitors considered in the first case study. For the twelve protein structures complexed with ligand and known binding site information, we deliberately removed the ligands from the target proteins and uploaded the target to Sanjeevini for a blind docking with the ligand. For Bovine pancreatic trypsin receptor, a structure with unknown binding site information (PDBID 1S0Q) is also available in the literature [128,129] along with a protein-ligand complex (PDBID 1S0R) which was taken as an input in the first case study.
The target receptor with unknown binding site and its synthetic inhibitors were given as input to Sanjeevini. AADS module gave an output of five docked structures along with binding free energies. A total of 230 docking runs corresponding to 10 binding sites for each target were performed in an automated mode by Sanjeevini in the above case study for the 23 trypsin binding molecules. We compared the predicted binding free energies of the energetically top ranked structure for each target (shown in Table 1) and plotted a correlation graph between the experimental and predicted binding free energies (shown in Figure 7).
In the Bovine pancreatic trypsins, the amino acids mainly involved in interactions with the ligand molecules are reported to be Ser 172, Asp 171 and Gly 196 in the target protein (PDBID 1S0R) [104]. We visualized the docked structures obtained from the above blind docking studies of trypsin inhibitors against the target (PDB ID 1S0Q) to make sure if the top ranked docked structures have the native ligand pose restored in the native binding site of target. A good estimate of the binding free energies     through Sanjeevini protocol in the above two case studies evident from a high correlation coefficient obtained (Figures 6 and 7) by two different methodologies taking care of inputs with known binding site and unknown binding site information in a protein target illustrates the strength of the Sanjeevini software.

Future directions of Sanjeevini
Improvements conceived in the future versions of Sanjeevini are: (i) consideration of the flexibility of the candidate ligand molecules, and the active site amino acids of the target, (ii) docking and scoring of the candidate molecules in the presence of a cofactor or multiple metal ions, (iii) extension of the DNA docking and scoring methodology to DNA binding intercalators and eventually (iv) creating an assembly line from genomes to hits [130].

Conclusions
This article presents Sanjeevini, a state of the art, structure based computer aided drug discovery (SBDD/ CADD) software suite implemented on an 80 processor cluster and presented to the user as a freely accessible server. The high accuracy of the modules and a user friendly environment should help the user in designing novel lead compounds.

Availability and requirements
Project name: Sanjeevini Project home page: http://www.scfbio-iitd.res.in/sanjeevini/sanjeevini.jsp Operating systems: Linux Programming languages: C++ and java Any restrictions to use by non-academics: none A detailed tutorial with various inputs and outputs of Sanjeevini in the form of snapshots is available at the following link http://www.scfbio-iitd.res.in/sanjeevini/ example/Tutorial.pdf. The coordinates of the validation dataset of 335 protein/DNA targets are available at the following link http://www.scfbio-iitd.res.in/sanjeevini/ dataset.jsp.