Predicting PDZ domain mediated protein interactions from structure

Background PDZ domains are structural protein domains that recognize simple linear amino acid motifs, often at protein C-termini, and mediate protein-protein interactions (PPIs) in important biological processes, such as ion channel regulation, cell polarity and neural development. PDZ domain-peptide interaction predictors have been developed based on domain and peptide sequence information. Since domain structure is known to influence binding specificity, we hypothesized that structural information could be used to predict new interactions compared to sequence-based predictors. Results We developed a novel computational predictor of PDZ domain and C-terminal peptide interactions using a support vector machine trained with PDZ domain structure and peptide sequence information. Performance was estimated using extensive cross validation testing. We used the structure-based predictor to scan the human proteome for ligands of 218 PDZ domains and show that the predictions correspond to known PDZ domain-peptide interactions and PPIs in curated databases. The structure-based predictor is complementary to the sequence-based predictor, finding unique known and novel PPIs, and is less dependent on training–testing domain sequence similarity. We used a functional enrichment analysis of our hits to create a predicted map of PDZ domain biology. This map highlights PDZ domain involvement in diverse biological processes, some only found by the structure-based predictor. Based on this analysis, we predict novel PDZ domain involvement in xenobiotic metabolism and suggest new interactions for other processes including wound healing and Wnt signalling. Conclusions We built a structure-based predictor of PDZ domain-peptide interactions, which can be used to scan C-terminal proteomes for PDZ interactions. We also show that the structure-based predictor finds many known PDZ mediated PPIs in human that were not found by our previous sequence-based predictor and is less dependent on training–testing domain sequence similarity. Using both predictors, we defined a functional map of human PDZ domain biology and predict novel PDZ domain function. Users may access our structure-based and previous sequence-based predictors at http://webservice.baderlab.org/domains/POW.

A. Parameters for structure feature generation software 1

. Solvent accessibility and hydrogen bonding properties
• Joy website [1]: http://tardis.nibio.go.jp/cgi-bin/joy/joy.cgi • PDB files were uploaded and the resulting LaTEX output file was downloaded and parsed.

Electrostatic and hydrophobicity
• VASCo software: http://genome.tugraz.at/VASCo • The software uses the the program DelPhi [3,4] to compute the electrostatic potentials and HydroCalc to compute the hydrophobicity values. The default parameters, as distributed in the VASCo package for these programs, were used.
Both programs require the calculation of surface points which as performed by the MSMS software [5]. For all programs the default probe size of 1.4 was used.

B. Binding specificity similarity calculation
The distance between two PWMs a and b is the normalized Euclidean distance: where n is the number of columns in the PWM. This metric is normalized such that 0 represents perfectly similar PWMs and 1 represents perfectly dissimilar PWMs. The similarity between two PWMs is therefore 1 minus the distance.  (solvent accessible area) [2], VASCo (electrostatics), VASCo (hydrophobicity) [9] and 3D Zernike descriptors (structure shape) [6]. Five predictors were trained with all but one of the feature sets and the performance for multiple cross validation strategies was measured. For all strategies except for the leave 12% of domains out, the performance across all predictors is comparable. For the strategy that involved leaving sets of domains out, the performance improves only if the 3D Zernike descriptors are not used. Therefore, the final domain structure feature encoding did not include these features.

E. Semi-supervised negative training set expansion
An initial predictor was built using the data for 88 PDZ domains described above. A  Figure S2, left boxplot). Since previous phage display experiments detected fewer than a hundred binders per domain among billions of random peptides, the majority of these initial predictions are likely false positives. We surmised that the initial negative training data did not adequately cover the negative proteomic interaction space. Therefore, we added additional negative interactions for the 24 human and 22 mouse domains that the predictor returned 1000 or more hits for.
We used a semi supervised learning approach similar to a method previously used to expand negative training data sets when there are no negatives initially available [10]. For the human domains, a SVM was trained using the initial human training data. This SVM was then used to predict additional negative interactors by scanning a pool of unlabelled non-redundant C-terminal peptides obtained from the human proteome (in total 2522 peptides). The negative interactors were then sorted in order by decreasing decision value and 100 peptides were sampled. The same was repeated using the initial training data for mouse with the pool of non-redundant mouse proteomic C-terminal peptides (in total 2348 peptides). A SVM for both mouse and human was then trained using all the initial training data plus the additional predicted negative interactions.
We used this predictor to scan the human proteome for interactors of training domains

Number of Predictons vs Training Negatives Used
peptide was considered to be genomic if the last four residues can be found in a proteomic tail, otherwise it was considered to be non genomic [11]. Numbers in bold indicate which similarity (sequence or structure) is higher (i.e. which predicted logo is closer to the experimental logo).

H. Structure based predictor blind testing performance
Blind testing was performed to obtain an unbiased measure of predictor performance and to determine if the predictor could correctly predict interactions in other organisms not represented in the training set (such as fly and worm). We used interaction data for 13 mouse, seven worm and six fly PDZ domains with interactions from previous protein microarray experiments which were not previously used for training [8] (Table S2).
Homology models were generated by SWISS-MODEL and have at least 40% sequence identity to their template structures and no binding site gaps. The average template sequence similarity was 92%, 61% and 61% for mouse, worm and fly domains, respectively. An NMR structure was available for one fly domain (PAR6-1) and the first model was used (1RY4 A). One mouse domain (CHAPSYN-110-1) was removed from the test set because its performance was consistently poor for both sequence-based and structure-based predictors (see Additional file 2, Table S2).
The blind test results show that the structure-based predictor is able to correctly predict many unseen interactions in fly, worm and mouse ( Figure S4) and that its performance is similar to the sequence-based predictor (   Table S2, Table S3).
The predictor correctly predicted interactions for all worm (six out of six) and five out of seven fly PDZ domains, respectively. For these domains, approximately 46% and 54% of known interactions for worm and fly, respectively, were predicted (see Additional file 2, Table S6, Table S7). Using the negative interactions from the protein microarray experiments, we computed the FPR and found it to be on average 0.197 and 0.157 for worm and fly, respectively. These results suggest that the predictor is able to correctly predict both positive and negative PDZ domain-peptide interactions in different organisms.