Dataset sources
Our initial training dataset was composed of X-ray determined protein structures obtained by an advanced search of the Protein Data Bank [20] that filtered the returned results by 30% sequence similarity and an x-ray resolution of 0-2 Å. The choice of filtering for only X-ray determined protein structures at this resolution was made to minimize any variation between structures in our data that may be caused by using different experimental methods with widely varying resolution levels. Filtering the sequences by 30% sequence similarity prevented duplicates from appearing in the data and helped keep the training data more diverse. The next stage of filtering removed structures with more than 20% disordered residue content (more than 20% of the residues in the sequence were not represented in the experimentally determined list of atomic coordinates) and structures with sequences that were not within the range of 30 to 300 residues in length. These two steps were mainly performed to improve the quality of the proteins in our dataset and to avoid introducing proteins which were too large to process in a reasonable amount of time. Our testing dataset was composed of the target proteins in the CASP11 (The 11th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction) [21] dataset and was used to perform the final filtering on the training dataset. We used Clustal Omega [22] to filter homologous sequences from the training dataset if they had 25% or greater sequence similarity to any of the CASP11 proteins. Our final training dataset was composed of 1995 proteins after this filtering process. Instructions to reproduce this dataset can be found on the website for our RFcon web server.
Prediction programs
Our feature generation process began with taking the sequences of our training proteins and using them as input for three prediction programs. We used PSIPRED version 3.5 [23] to predict secondary structure and ACCpro version 5.1 (a member of the SCRATCH 1.0 package) [24] to predict solvent accessibility. PSIPRED’s predictions are given for each residue in the sequence and are encoded in our examples as three probability values representing the likelihood of that residue being within either a β-sheet, α-helix, or a coil. ACCpro’s predictions are encoded for each residue in our examples with a binary number that represents whether the residue is solvent accessible or not [25].
DCA_cpp is our C++ implementation of the direct coupling analysis methods described by Morcos et al. [26] and was tested both as a part of the training data and as an independent prediction method in our evaluations. HHblits [27] was used to generate a multiple sequence alignment (MSA) against the Hidden Markov Model (HMM) database of “uniprot20_2015_06”. We used this MSA to obtain a sequence profile of the input protein sequence and a contact map with predictions scored by confidence value. Here the only input for DCA_cpp was the amino acid sequence of the proteins in our datasets.
Feature generation process
We developed a python script that parsed the structure file of each protein in our datasets and combined the output of the previously mentioned prediction programs to generate the features that were used to train and test our machine learning models. Each example in this set of features represented the interaction between a pair of residues centered within two sliding windows that each contained 11 residues. For the models trained with binary classification, the target value (class label) of each example was set to positive (in contact) if the three dimensional Euclidean distance between the coordinates of the two residues’ α-carbons was less than or equal to 8 Å and set to negative (not in contact) otherwise. For the models trained using regression, the target value t was defined by the following equation where d is the Euclidean distance between the α-carbon atoms belonging to the residues at the center of each window, and c is a constant (set to 3 in our calculations).
$$ t=\frac{1}{1+{\left(\frac{\mathrm{d}}{\mathrm{c}}\right)}^2} $$
(4)
The action of sliding the windows along the sequence of the protein refers to the process for generating the feature data for each pair of residues used in the examples. During the parsing of each protein, the centers of the two windows were held at a minimum distance of 6 residues apart in the sequence at all times while sliding down the sequence of the protein. Figure 2 depicts the residues involved in one possible example with the first window (Fig. 2b) centered on the first residue of the protein and the second window (Fig. 2d) centered on the 62nd residue in the protein. Since the size of the window is static during feature generation, the first window (Fig. 2b) in this example extends past the boundary of the sequence and must be partially filled with “empty residues” (Fig. 2a). We encode these residues with the same number of features contained in regular residues. However, all of these values are set to 0 except for one feature (set to 1) to indicate that the residue is not part of the sequence. Regardless of the actual difference between the positions of the two window centers, 50 “intermediate” residues are always included even if duplicates are used (Fig. 2c). The sequence position of each residue an in this set is defined by the following formula where n is an integer from 1 to 50, r is the sequence position of the residue in the first window center, and s is a scale factor calculated by dividing the total number of residues between the two window centers by 50.
$$ {\mathrm{a}}_{\mathrm{n}}=\left\lfloor \mathrm{r}+\mathrm{sn}\right\rfloor $$
(5)
Description of features
Each of the 72 residues included in every example was represented by a set of 32 features composed of various types of data. Five features described by Atchley et al. [28] were used to account for various physiochemical and biological properties of the amino acids. The sequence profile generated by DCA_cpp was encoded with 20 features. Both the Atchley features and these sequence profile features provide quantitative information that can describe the residue and its importance in the sequence more accurately. The remaining information used to describe each residue was represented with three features that encode PSIPRED’s secondary structure prediction, one feature that encodes ACCpro’s solvent accessibility prediction, two features that encode whether the residue is within the boundaries of one of the two windows or not, and one feature that encodes whether or not the residue is an “empty residue”. An empty residue is simply a place holder residue that is described by our feature generation script with a value of zero for every feature. This occurs when a residue needs to be encoded within a window that extends past the boundary of the protein. All of these values mentioned up until this point accounted for 2304 features in each example after picking up every combination of residues between the two sliding windows as described in the previous section. Then, 122 features were added to represent DCA_cpp’s contact prediction probabilities between the center of the first window and every residue until the center of the second window. Thus, the total number of local features which describe each data point is 2426.
In addition to these local features, 63 global features that described aspects of the protein as a whole were added to each example. Here, global featues are used to describe the protein as a whole rather than only one isolated example. For each example in the training data corresponding to one target protein, the global features remain constant. First, 20 features were used to encode the amino acid composition of the protein. These features represent the frequency of occurrence in the sequence for each amino acid. The next 40 features encode the pseudo amino acid composition of the protein as described by Chou [29]. The final three features in this category encode the exposed secondary structure composition. These features may assist in contact prediction by helping the model directly consider the shape of the protein at points of possible contact. These values represent the percentage of PSIPRED secondary structure predictions that were predicted to be exposed by ACCpro. In total, each example pair of residues is described with 2489 features. Additional details about these features can be found in the supplementary information (see Additional file 1).
Repeating this feature generation process for every pair of residues in the set of training proteins produced a dataset of examples that was then split into three groups based on the sequence separation (the difference of the numeric position between the residues in the two window centers) of the residue pairs in each group. The examples in these groups were organized with the separation range of 6–11 residues placed into the “sep6” group (199,536 examples), the separation range of 12–23 residues placed into the “sep12” group (245,914 examples), and the separation range of 24 residues and greater placed into the “sep24” group (743,600 examples). Since the number of negative examples gained during parsing the proteins was much greater than the number of positive examples, we retained all positive examples from each target protein but randomly sampled an equal number of negative examples.
Feature selection
The large amount of features in our original training datasets required a very large amount of storage space and caused training to run very slowly with the machine learning methods we tested. Therefore, we used the randomForest [30] library in R to train three RF models on three subsets of 10,000 randomly selected examples (maintaining the balance of positive and negative examples and sequence separation categories) from the sep6, sep12, and sep24 datasets. These models were trained with the default “mtry” parameter, 1500 trees, and a “nodesize” of 10. The top 100 features with the highest “mean decrease Gini” values were then selected and three new datasets were created using only these 100 features from the full set of examples. This feature selection process was used to generate the features for all of our models described in the results section except for the “rf_full” models which used the full feature set. We also used this same procedure to select features based on the “mean decrease accuracy” metric. These features were used to train a different set of models. More details about the difference between these two sets of selected features can be found in the supplementary information (see Additional file 1).
Model training and optimization
We trained our SVM models (svm) by using the SVM_light implementation [31] of the SVM algorithm with our three feature selected training datasets. Our SDA models (sda_balanced, sda_unbalanced, and sda_Ensemble) were generated with a Theano implementation of the SDA algorithm that we have utilized in previous studies [32,33,34]. We trained our random forest models (rf_full and rf_selected) using the randomForest library in R [30]. The training process for all of these models was guided by observing the cross-validation accuracy during training and choosing new parameters until no further improvement could be made. Here, we describe the training process used to determine the final parameters and architecture for the models used in each method. In order to more accurately compare the differences in performance among our own methods, all of our final models were optimized by tuning various hyperparameters to achieve high performance on our training dataset. The effect of tuning these various parameters was closely monitored and validated using standard cross-validation techniques for every method except for random forest, which uses out-of-bag error estimation to achieve similar optimization results.
Support vector machine (svm)
Our final three SVM models were created by running SVM_light in classification mode using the linear kernel with the default parameters. Five-fold cross-validation was used to test the effects of different training parameters on the training dataset discussed in the methods section. We removed all of the examples in each training fold that were from proteins present in that training fold’s respective test fold and also maintained the equal class balance of positive and negative contact examples in all of the folds. For optimization purposes, we experimented with many of the different kernels available in SVM_light. More information about tuning our SVM models can be found in the supplementary information (see Additional file 1) and the results of an optimization evaluation can be found in the supplementary data (see Additional file 2).
Stacked Denoising autoencoder (sda_balanced, sda_unbalanced, sda_Ensemble)
All of our final SDA-based models were trained with four-fold cross-validation by first splitting each of the three training datasets (the same as we used in the SVM training procedure) into two equal halves and removing duplicate proteins. We reserved one of the halves for training the final ensemble and then used the other half for training the individual SDA models. We prepared the data for the “sda_balanced” models by splitting the reserved model training data into two subsets of 25 and 75% of the reserved examples. These subsets were each split in half and duplicate proteins were removed. The final four folds were made by combining these subsets using each just once for pretraining, training, validation, and testing by the SDA algorithm. We used these folds to perform cross-validation and optimize our models. More information about the various models that we tuned can be found in the supplementary information (see Additional file 1).
The final three “sda_balanced” models were sep6 (one hidden layer with 33 units), sep12 (10 hidden layers with 30 units each), and sep24 (one hidden layer with 33 units). The “sda_unbalanced” models were created by taking the four folds used for training the “sda_balanced” models and randomly downsampling the positives examples to a ratio of one positive example to every five negative examples. The final three “sda_unbalanced” models were sep6 (one hidden layer with 66 units), sep12 (five hidden layers each with 80 units), and sep24 (one hidden layer with 80 units and a second hidden layer with 40 units). All of the hidden layers in our SDA models shared a common corruption value of 0.1.
The “sda_Ensemble” models were trained by selecting a group of several of the “sda_unbalanced” models and combining them with an SVM model. This was done by selecting some of the highest performing models and using their predictions for the reserved data as the only input for training the final ensemble SVM models with linear regression. After verifying performance with five-fold cross-validation, the final ensemble models were sep6 (10 sda_unbalanced models, SVM -c parameter = 0.01), sep12 (10 sda_unbalanced models, default SVM parameters), and sep24 (5 sda_unbalanced models, default SVM parameters).
Random Forest (rf_full and rf_selected)
Our first three random forest models were trained on the datasets that had undergone feature selection. After trying various parameters and observing the effect on the OOB (Out Of Bag) estimate of the error on the training data, we used 1501 trees, a nodesize of 10, and the default value of mtry for all three of these “rf_selected” models. The last three models (“rf_full”) were trained on the full feature set before feature selection had been performed. The model trained on the full sep6 dataset used the same training parameters as the previously feature selected models. As a result of memory and computational time limitations, we chose different parameters for the sep12 and sep24 models. We chose to use 551 trees for these two models since the OOB error on the training set stopped noticeably decreasing at this point. The nodesize parameter was set to 10, and “mtry” was left at the default value.
Additional models
In order to more closely examine our features and their importance, we also trained all of the previously mentioned models using mean decrease accuracy feature selection rather than mean decrease gini. We performed this feature selection using the same process and random forest models that generated the mean decrease gini featues. This allowed us to compare two sets of selected features for each sequence separation category and observe the effects of training models using them. In addition to this, we trained a selection of the original feature selection models (mean decrease gini) using only 50% of the features.