Software tools
MySQL Administrator Tool (Workbench 6.0 CE), Eclipse, Apache Tomcat 9 server, Java 1.8.
Data sources
The PDZ protein data and related information such as sequence, organism information, available structures, localization and related literature were obtained from publicly available Uniprot-KB/Swissprot (release 2017_20) database. Uniprot-KB/Swisprot being the largest and up-to-date manually curated database of proteins was the most important data source for PDZscape. The entire PDZ domain containing data was manually downloaded from Uniprot-KB as flat files and the other related information was extracted out in a CSV (Comma Separated Values) file using Perl regular expressions. This CSV file was further used for the creation of tables in a database using MySQL server. Various information on PDZscape proteins such as their Gene information-Gene IDs, structural information-PDB IDs [9], protein-protein interaction information- STRING IDs [10] and Pathway information - KEGG IDs were taken from their respective databases. The number and organizations of PDZ domains have also been included separately. Information on PDZ-interacting proteins was obtained from IntAct database [11] that is the largest molecular interaction network database, which includes all the interacting partners of a protein. Sequences of all PDZ-containing proteins were assimilated to form the source database of PDZ-BLAST (Basic Local alignment Search Tool). The stand-alone BLAST (2.6.0+) [12, 13] feature was downloaded from NCBI website and integrated in the database. For searching and homology scoring functions, default matrices of BLAST [12, 13] were used. Known mutations were extracted from PMDB and some of them were manually curated from literature. The phenotypic implications associated with each of these mutations, if any, have also been included in the PDZscape output. Extensive manual curation was also performed for obtaining information on the association of PDZ containing proteins with different pathological conditions. Currently, more than 300 entries have been curated with the known mutations and their association with various diseases. This information is available for Human PDZ proteins and will be periodically updated. Manual curation was performed using literature search on PubMed and filtering the papers based on proven disease associations from experimental studies. Literature reports for proteins, where only mutations have been reported without any information on associated diseases, have also been included in the database wherever relevant. It has been observed that disease-association in PDZ-containing proteins is not limited to mutations but depends on other factors as well that include over-expression, chromosomal deletion, inhibition etc. These diseases have also been reported in this database. The entire data takes into consideration all PDZ proteins and can be retrieved in ‘.csv’ format from download page.
PBPFinder
To facilitate reverse search and find whether the protein of interest is a PDZ-interacting partner, a simple tool, based on sequence similarity and ID mapping has been developed in Java, which takes UniProt ID of a protein as an input and finds whether it is a PDZ-binding protein. In both protein and peptide mode, this tool scans the given sequence for presence of known PDZ-binding motif, which are stored as regular expressions, based on reports from the literature [14]. PBPFinder is a simple tool that first scans the database of known PDZ-binding proteins with given sequence or Uniprot ID and database of known PDZ-binding motifs in order to report the possibility of given protein being a known or putative PDZ-binding protein.
Database integration
PDZscape database was developed with JavaScript using Eclipse Juno software development environment. For data integration and parsing, programs were written in JavaScript and Perl. These programs were used to search and parse the data on PDZ proteins and their interacting partners from flat files to create output files in MySQL Tables. All MySQL queries to the databases were implemented in a Javascript page using Java-based data access technology (JDBC) connection and have been uploaded on Linux-based TomCat 9.0 Apache server. The linking of entire data integration server was done using Eclipse Software Development Environment (SDE). Eclipse is a multi-language Integrated Development Environment (IDE) comprising a base workspace and an extensible plug-in system for customizing the environment, which is widely used for database and software development. This database has been constructed using various sources of PDZ-containing protein sequences, and information on their structure and interacting partners. The information so compiled not only includes well established interacting partners but also putative ones that can prove to be useful leads for further investigation by researchers. This wide range of data was amalgamated together to form a new and comprehensive knowledge-base for PDZ proteins.