Molecular modeling in large scale
Homology modeling is usually the method of choice when there is a clear relationship of homology between the sequences of a target protein and at least one experimentally determined three-dimensional structure. This computational technique is based on the assumption that tertiary structures of two proteins will be similar if their sequences are related, and it is the approach most likely to give accurate results [16].
The number of protein sequences that can be modeled and the accuracy of the predictions are increasing steadily due to the growth in the number of experimentally determined protein structures and because of the improvements in the modeling software. It is currently possible to model with useful accuracy significant parts of approximately one half of all known protein sequences [17].
The molecular modeling in this work was performed by the MODELLER version 9v4 [10, 18] program, which is a computer program for comparative protein structure modeling http://salilab.org/modeller. The program extracts atom-atom distance and dihedral angle restraints on the target from the template structure, and combines them with general rules of protein structure such as bond length and angle preferences. The model is then calculated by an optimisation procedure that minimises violations of the spatial restraints [11]. In the simplest case, the input is an alignment of a sequence to be modeled with the template structures, the atomic coordinates of the templates and a short script file. MODELLER then automatically calculates a model containing all non-hydrogen atoms, without any user intervention and within minutes on a processor [19].
The MODELLER program was completely automated to calculate comparative models for a large number of protein sequences, by using many different template structures and sequence-structure alignments [12, 16, 17]. Sequence-structure matches are established by aligning SALIGN [20] sequence profile of the target sequence against each of the template sequences extracted from PDB [21]. Significant alignments covering distinct regions of the target sequence are chosen for modeling. Models are calculated for each one of the sequence-structure matches by using MODELLER [11]. The models consist of coordinates for all non-hydrogen atoms in the modeled part of a protein [16]. For each enzyme in the SKPDB, a total of 1000 models were generated and the final models were selected based on stereochemical quality and objective function by MODELLER. The final models were then evaluated by composite model quality criteria (see topic Analysis tools).
The MODELLER program was parallelised on a Beowulf cluster with 16 nodes (AMD Athlon 2100+ BioComp. São José do Rio Preto, SP, Brazil). It controls the distribution of the MODELLER jobs on the Beowulf cluster using a C language library, message passing interface (MPI). It allows parallel MODELLER execution and decreases the processing time of the modeling process. Figure 2 shows a schematic drawing of automated comparative modeling in large scale used in the modeling process for SKPDB.
Analysis tools
Difficult cases in homology modeling correspond to protein sequences that only possess distant homologues of known structure. In such cases, incorrect alignment can lead to regions of a model that have significant structural errors. The quality of the predicted model determines the information that can be extracted from it. Thus, estimating the accuracy of 3D protein models is essential for interpreting them. The model can be evaluated as a whole as well as in the individual regions [13].
The overall stereochemical quality and the evaluation of the final model were performed by the programs PROCHECK [22] and WHATCHECK. These programs were used to check bond lengths, bond angles, peptide bonds and side-chain ring planarities, chirality, main-chain and side-chain torsion angles. Another quality score used in the analysis of the structural model was the G-factor, which is essentially just a log-odds score based on the observed distributions of the stereochemical parameters performed by the program PROCHECK [22]. The root mean square deviations (RMSD) from ideal geometries for bond lengths, bond angles, dihedrals and impropers were extracted for each model by using the program X-PLOR [23], and the program VERIFY-3D was used to measure the compatibility of a protein model with its sequence by using a 3D profile [24, 25]. These programs were used to assess the quality of the available models and can be accessed by any user in the SKPDB web page for each enzyme.
Web SKPDB platform
The SKPDB is implemented on Apache server 2 http://apache.org/ with Fedora 9 http://fedoraproject.org/ as an operating system. The MySQL server version 4.0.20 http://www.mysql.com is used in SKPDB to store, retrieve and manage the data. All scripts for data querying and retrieving were written in PERL/CGI version 5.8.4 http://www.perl.com/, and JAVA http://java.sun.com. The web interfaces are designed using HTML language with some scripts in JavaScript, and they rely on Cascading Style Sheet (CSS) support. The modeled structures can be viewed with Jmol http://jmol.sourceforge.net/, which is an open source software suite. Results are displayed in html format.
Data Source
All entries in SKPDB were sourced from Swiss-Prot/UniprotKB [26] protein sequence database and PDB [21] protein structure database. Initially, exhaustive queries were made to Swiss-Prot/UniprotKB, returning more than 10.000 enzymes of shikimate pathways from different organisms. The process of building SKPDB is shown in figure 3. The enzyme data were then filtered to exclude redundancy, errors, and incomplete data. Then the data were included into a single composite non-redundant database.
Currently, SKPDB has 8902 enzymes of the shikimate pathway from microorganisms and plants. The majority of the protein structures (5477 entries) were predicted by comparative modeling, though the database also includes 53 protein structures solved by crystallography and 3372 proteins without any structure solved/modeled.
SKPDB description
SKPDB is a relational database of protein structures predicted by comparative modeling or solved by crystallography, applied to the study of shikimate pathway enzymes. Each entry in SKPDB provides information about a given enzyme, including: (1) a detailed description of the enzyme, (2) the primary sequence of the enzyme, (3) the structure model of the enzyme, (4) the chemical properties of the enzyme, (5) references about the enzyme, and (6) comments and miscellaneous information. All files (primary sequence, atomic coordinates and quality values) are available for downloading. This database is available for all users on the Web, providing a large amount of structural models to be used in virtual screening initiatives and molecular docking.
The SKPDB is regularly updated with the addition of new data and tools about shikimate pathway enzymes. A click on the links opens a new window that displays more detailed information for the selected enzyme, in different biological databases such as Swiss-Prot/UniprotKB, PDB, KEGG, BRENDA, IUMB, and PUBMED, among others. The enzyme records page contains primary sequence and structure of the model, information about alignment, analysis of target models such as PROCHECK, G-factor and the values of the RMSD from ideal geometry.
Description table content in SKPDB
For data storing, a database was built that contained the following tables: sequence, model, template and analysis. The data were included in the tables through the scripts in Perl and the use of language specific modules (DBD:: mysql) that allow for interaction with the database, as shown in figure 4. The database is queried from an html client using Perl-CGI programming, which displays the records as dynamically generated web pages in different frames.