Training and testing dataset
In an attempt to exclude proteins participating in obligate interactions, we generated our training set by selecting only proteins which can form stable structures in vivo without needing to be bound to other proteins ; crystal structures of obligate interactions should not satisfy this criteria. We only took proteins with structural deposits in the PDB as it is a necessary requirement for the analysis of structural feature that contribute to PPIs. Many intrinsically disordered proteins (IDPs), which lack stable tertiary structures, also participate in transient PPIs. The lack of PDB entries of IDPs limits their use in PPI prediction. The training set consisted of a collection of proteins having only one entry in the COMPND record of the PDB, representing proteins in an unbound state so they could be queried and trained in their native conformations. Protein files with a single COMPND entry most often contain multiple, nearly identical, chain entries representing either subunits of a homomultimer or similar conformations of the same protein included for completeness; these files represent the monomers in the training set.
To control the quality and variety of the monomers in the training set, proteins were filtered by properties in their PDB files. These included structure resolution (≤3.5 Å), number of chains (≤5), chain length (≥100 and ≤800), and chain length standard deviation (the difference between the multiple chains of the same compound listed in the PDB file was ≤5), ensuring proteins in the set were of high quality and regular size, as well as ensuring regular handling of chain heads and tails. The number of connections (≤1000) and number of disulphide bonds (≤100) were filtered to remove unusual proteins such as 2IWV, a monomeric porin laced with ligands, and 3V83, a transport protein with a high number of disulphide bridges. In addition, antigen-antibody complexes were removed and the presence of nucleotides in the protein structure was disallowed to restrict training samples to strictly protein-protein interactions. We also mandated that crystallographic B-factors be given for all atoms so that the values could be used directly as a feature by machine learning.
With the above criteria applied to a 2013 January snapshot of the PDB, the monomer dataset is reduced to 48,424 structures. The majority of protein exclusion was due to chain length restrictions, as many smaller entities in the PDB are not proteins at all but are instead fragments or pharmacological compounds. Lastly, proteins with more than 70% sequence identity, as determined by BLAST , were considered duplicates, and only one was kept. This was done to avoid the over-representation of proteins whose structure has been recorded many times, though some homologues may remain in the dataset. Of each remaining structure, the chain used to represent the monomer was the longest chain in the PDB file or the chain with the lowest sum of B-factors (should there be several chains of the same length). If no chains had a lower sum of B-factors or longest length, then the first chain in the file was used.
A protein complex was defined as a structure having more than one COMPND entry. To identify which monomers in the dataset have interaction partners, each monomer was aligned to all chains of all protein complexes in the PDB. A match was found if there was at least 80% sequence identity between one of the chains in the complex and the monomer. A geometric definition of a PPI was adopted for the identification of interacting residues using a distance cut-off between chains. For a residue to be considered part of an interacting site, it was required that there be at least one pair of atoms at a distance of at most 4.5 Å between two interacting chains.
Interface mapping successfully mapped 663 monomers to complexes. After PDB validation and feature calculations, 392 monomers remained, with many discarded proteins representing complexes without acceptable interfaces as judged by our program. An unacceptable interface site was most commonly found on a complex where there were no atoms within 4.5 Å between two “interacting” chains or a site that had less than 90% residue identity between the complex and monomer representations of the chain (usually due to a poorly resolved residue), which did not allow us to exactly map the interacting residues directly back to the monomer.
The monomer-complex interface mapping process described is effective for complexes stored directly in the pdb file, but ignores the biologically relevant multimeric form as predicted by the crystallographers (stored in the BIOMT field of the pdb files). For all proteins in the training set, the BIOMT data was analysed and interface mapping was performed, treating monomer files, as well as the complexes to which they were mapped, with the appropriate BIOMT information, thereby creating new interfaces for use in machine learning. Biologically, these are important to preserve, but they made no statistically significant difference to the performance of the predictor on the dataset as measured by cross validation. Most importantly, we aimed to maintain consistency with datasets used by other learners derived from data mining, and as such, these interface sites are not included in the below data, but data and results for most tests below is available in the supplemental material (Additional file 1: Table S5).
Improving data labelling
It must be addressed that a protein with a low percentage of observed interacting residues is likely not a result of lack of interactions, but rather a lack of complexes by which to identify the interacting residues of the protein. Thus, the use of proteins with unknown or unidentified interactions, which can be a problem even in manually curated datasets, leads to a systematic increase in type II error. As it is impossible to guarantee knowing every interface site for a given protein, an approach of assuming better identification of these sites in proteins with higher rates of identified interacting residues was adopted. We considered a protein as divisions of 100 residue domains, as suggested by the typical size of a domain , and assumed the accuracy of residue labelling was quite good if there was at least one interface site per domain (based on observations of protein domain sizes and interaction site distribution [33, 34]), and comparatively better if there were at least two interface sites per domain. This reduced the possibility of missing interfaces, though potentially not fully eliminating it. To thoroughly explore the effect of data labelling accuracy, two testing subsets were generated from the full training set: a set with proteins where there is at least one interface per 100 residues (NI1) and a set of proteins where there are at least two interfaces per 100 residues (NI2).
Compensating for class imbalance
In total, only a small fraction of all residues in the full training set were identified as interacting, creating a class imbalance problem in MLPIP where the number of non-interacting residues far outweighs the number of interacting ones, resulting in a high bias toward the majority class [35–37]. The issue was explored with random under-sampling, where instances of the majority class are removed until a target distribution of each class is attained. To find the optimal distribution, a set of proteins was repeatedly under-sampled to produce training sets with the proportion of interacting residues ranging from 30% to 70% of the training instances. These were then compiled, trained upon, and cross-validated for their true positive rates and true negative rates to find the point at which these rates met for a balanced prediction.
Machine learning features
Features used to predict interface sites were categorised as either residue features or residue-local features. To calculate residue-local features, a residue was considered with each of its neighbours, a set typically ranging from two to six residues on which calculations were performed over their collective surface area with the resulting value applied to the nodal residue. The use of residue-local features allows the calculation of the basic geometric attributes: protrusion, roughness, surface density, and curvature. The scope of these features is limited so that the integrity of the residue value is little affected by the values of residues further away on the protein. A baseline set of residue features was chosen from the pool of commonly-used features in MLPIP: these were relative solvent-excluded surface area (relSESA), B-factors, hydrophobicity, propensity, electrostatic potential, energy of solvation, conservation, evolutionary rate-shift, and disorder. For details on the calculations of each of these features, please refer to Additional file 2. To ensure proper citation of tools used to calculate some of these features, we note that MSMS  was used to calculate Solvent Excluded Surface Area as defined by Connolly , the mean maximum residue surface area was calculated using UCSF Chimera  (Additional file 3: Table S1); the CX algorithm  was used to calculate protrusion; roughness was calculated as a function of protein surfaces, as recommended by Pettit and Bowie ; the scale by Fauchére and Pliska  was used to calculate hydrophobicity; electrostatic potential was derived with APBS ; the curvature calculation was modelled on an approach by Coleman et al. ; rate shift was computed with the Rate4Site algorithm ; and conservation was given by the ScoreCons algorithm . These tools are commonly used in MLPIP studies and were here used to give values as similar as possible to previous efforts for individual attributes.
Statistical analysis of results
Prediction of interacting site was evaluated and measured using the following statistics:
Where TP is the number of true positives, TN is true negatives, FP is false positives, and FN is false negatives. The statistics calculated here are: true positive rate (TPr), also commonly referred to as sensitivity or recall, true negative rate (TNr), also known as specificity, precision (PRC), the Matthews correlation coefficient (MCC), and F measure (F1). Following LOOCV, mean MCC is given as an average score over a set of proteins. Pearson’s correlation between features and interaction site involvement was computed using MATLAB and evaluated using a two-tailed test of significance. Statistics of machine learning results were tested with the Wilcoxon signed-rank test using R, where the difference in score was ranked by absolute value and each rank was modified by the sign of the difference, i.e. whether the change was positive or negative after treatment. The modified ranks were then summed to generate the test statistic W, for which the sampling distribution approaches normality and a p-value can be calculated based on the z-score.