Computational prediction of plasma protein binding of cyclic peptides from small molecule experimental data using sparse modeling techniques

Background Cyclic peptide-based drug discovery is attracting increasing interest owing to its potential to avoid target protein depletion. In drug discovery, it is important to maintain the biostability of a drug within the proper range. Plasma protein binding (PPB) is the most important index of biostability, and developing a computational method to predict PPB of drug candidate compounds contributes to the acceleration of drug discovery research. PPB prediction of small molecule drug compounds using machine learning has been conducted thus far; however, no study has investigated cyclic peptides because experimental information of cyclic peptides is scarce. Results First, we adopted sparse modeling and small molecule information to construct a PPB prediction model for cyclic peptides. As cyclic peptide data are limited, applying multidimensional nonlinear models involves concerns regarding overfitting. However, models constructed by sparse modeling can avoid overfitting, offering high generalization performance and interpretability. More than 1000 PPB data of small molecules are available, and we used them to construct a prediction models with two enumeration methods: enumerating lasso solutions (ELS) and forward beam search (FBS). The accuracies of the prediction models constructed by ELS and FBS were equal to or better than those of conventional non-linear models (MAE = 0.167–0.174) on cross-validation of a small molecule compound dataset. Moreover, we showed that the prediction accuracies for cyclic peptides were close to those for small molecule compounds (MAE = 0.194–0.288). Such high accuracy could not be obtained by a simple method of learning from cyclic peptide data directly by lasso regression (MAE = 0.286–0.671) or ridge regression (MAE = 0.244–0.354). Conclusion In this study, we proposed a machine learning techniques that uses low-dimensional sparse modeling to predict the PPB value of cyclic peptides computationally. The low-dimensional sparse model not only exhibits excellent generalization performance but also improves interpretation of the prediction model. This can provide common an noteworthy knowledge for future cyclic peptide drug discovery studies. Electronic supplementary material The online version of this article (10.1186/s12859-018-2529-z) contains supplementary material, which is available to authorized users.

Plasma protein binding (PPB), is the reversible binding of compounds to plasma proteins, and thus an equilibrium exists between bound and unbound forms. The fraction bound to plasma protein at equilibrium (f b ) is an important pharmacokinetic property [29] since PPB is strongly related to the absorption, distribution, metabolism, excretion, and toxicity of such compounds. In most cases, only unbound portions of the compounds can be distributed into tissues, which then interact with the target proteins and are finally excreted from the blood [30,31]. The candidate compounds that do not have appropriate PPB value are dropped in the later stages of drug discovery [32,33]; however, experimental measurements are expensive and time-consuming. Moreover, the dropout of candidate compounds in the later stage increases the development costs. Therefore, it is necessary to estimate the PPB values of candidate compounds computationally in the early stages and prioritize development strategies.
As for small molecules, there are some reports related to the development of computational PPB prediction methods. PPB prediction methods are roughly classified into docking-based methods [34] and machine learning methods [35][36][37][38]. In docking-based methods, the PPB value is predicted using the molecular docking score on the basis of the pose in which compounds are docked to the plasma protein. Lexa et al. docked compounds to two major binding sites of human serum albumin (HSA) [34]. They reported that the weighted combination of the predicted LogP and docking score most accurately distinguishes between high-PPB-value compounds and low-PPB-value compounds, with an AUC of 0.94, when evaluated against a "strict set." In machine learning methods, the model is trained on experimental PPB values of compounds, and the model predicts PPB values of unknown compounds. Previously, Ingle et al. used 1045 pharmaceutical data for model construction with support vector machines, k-nearest neighbors, and random forests and they evaluated these models against test data of 200 independent compounds and 406 environmentally relevant ToxCast chemicals [36]. They reported that the consensus model ensembled by these three non-linear models yielded mean absolute error (MAE) of 0.151-0.155 for pharmaceuticals and 0.110-0.131 for environmentally relevant chemicals. On the other hand, we found no reports on the development of PPB prediction methods for cyclic peptides. This implied the difficulty of predicting cyclic peptides due to paucity of experimental PPB data of cyclic peptides. It is also difficult to predict the binding poses of cyclic peptides owing to their large size and flexibility.
In general, it is considered that experimental PPB values of cyclic peptides is necessary to predict that of other cyclic peptides. However, as mentioned above, PPB data of cyclic peptides are insufficient, and it is very difficult to construct a prediction model with cyclic peptides. In practice, a prediction model trained on public cyclic peptide data is not very accurate, as discussed in the results section. Meanwhile, there are sufficient PPB data of small molecules. If small molecule data are informative for predicting cyclic peptides, machine learning method will work well. It is assumed that the physicochemical phenomena of PPB is same and important factors for explaining and governing PPB of both small molecules and cyclic peptides are universal. Feature selection, which is a task in machine learning, has the potential to extract such factors. Indeed, sparse modeling is a well-known method for feature selection. If these factors can be successfully extracted through feature selection, they could facilitate the construction of a model that predicts PPB values of cyclic peptides using PPB data of small molecules.
Here, we first propose the sparse model construction method to predict PPB values of cyclic peptides using small molecule data. The low-dimensional sparse model not only exhibits excellent generalization performance but also improves interpretation of the prediction model.

Datasets
We used three types of datasets: small molecules, FDAapproved cyclic peptides, and cyclic peptides from inhouse experiments. All molecules are available in the SMILES format with the fraction bound to plasma protein (f b ) for PPB value listed in Additional file 1: Supplementary Table S1. f b is a real number between 0 and 1. For some molecules, the f b value is determined as not a specific value but a range [f bmin , f bmax ]. We calculated the averaged value obtained by (f bmin + f bmax )/2 and used it as f b of the molecule.
The PPB values were converted into pseudo-equilibrium constant parameters (ln K a ) for model construction, as there is a greater need for the resolution of higher f b values (f b > 0.8) than for that of moderate f b values (f b ≈ 0.5). The transformation equation is given by where C is a constant set to 0.3 as in a previous study [36]. The results of the ln K a predictions were converted back to f b for assessment of model accuracy according to a previous study [36]. To prevent divergence of the ln K a value, f b was scaled (f b × 0.99 + 0.005) as in [37].

Small molecule dataset
We used pharmaceuticals with experimental f b values originally corrected by Ingle et al. [36]. The training data and test data were split exactly as in [36]. We used 1017 out of 1045 training compounds and 194 out of 200 test compounds by removing compounds that could not calculate a part of molecular descriptors owing to failure of conformation generation. The former is the small molecule training data and the latter is the small molecule test data.

Public cyclic peptide drugs dataset
There are 24 cyclic peptides with PPB assay experimental results in DrugBank [39] (accessed November 6, 2017), which is a public database of FDA-approved drugs.

Original synthetic cyclic peptides dataset
As the number of publicly available data of cyclic peptide drugs is small compared to that of small molecule, we additionally designed and experimented with 16 cyclic peptides composed exclusively of natural amino acids. The synthetic cyclic peptide sequences are listed in Table 1. First, linear peptides were synthesized. Then, circularization was achieved by making a disulfide bond between N-terminal and C-terminal cysteine residues and confirmed by TOF/MS and HPLC analyses.
Human PPB values f b were determined by the equilibrium dialysis method [40]. Frozen human plasma was thawed immediately at room temperature. Then, the plasma was centrifuged at 3220 g for 10 min to remove clots and the supernatant was collected into a fresh tube. The working solutions of test compounds were prepared in DMSO at a concentration of 200 μM. Then, 3 μL of the working solution was removed for mixing with 597 μL of human plasma to achieve a final concentration of 1 μM (0.5% DMSO). The plasma samples were vortexed thoroughly. The dialysis membranes (HTD 96a/b Dialysis Membrane Strips MWCO 12-14 K, Cat. #1101, Batch# 1141 (12-17)) were soaked in ultrapure water for 60 min to separate the strips, then in 20% ethanol for 20 min, and finally in the dialysis buffer (100 mM sodium phosphate and 150 mM NaCl) for 20 min. The dialysis apparatus was assembled according to the manufacturer's instructions. Each cell was filled with the spiked plasma sample and dialyzed against equal volume of the dialysis buffer. The assay was performed in duplicate. The dialysis plate was sealed and incubated in an incubator at 37°C with 5% CO 2 at 100 rpm for 6 h. At the end of incubation, the seal was removed and 50 μL of samples from both buffer and plasma chambers were transferred to wells of a 96-well plate. 50 μL of blank plasma was added to each buffer sample and an equal volume of phosphate buffered saline was supplemented to the collected plasma sample. 300 μL of room temperature quench solution (acetonitrile containing internal standards (IS, 100 nM Alprazolam, 500 nM Labetalol and 2 μM Ketoprofen)) was added to precipitate protein. Samples in the plate were vortexed for 5 min and centrifuged at 3220 g for 30 min at 4°C. Then, the supernatant was transferred to a new 96-well plate with 100 μL or 200 μL water (depending on the LC-MS signal response and peak shape) for LC-MS/MS analysis.

Molecular descriptors
To characterize molecules, we calculated 2D descriptors of the compounds and 3D descriptors of conformers of the compounds. The descriptors were calculated using molecular_descriptors.py and QikProp provided by Schrödinger, LLC [41]; there are 281 descriptors in total.
As conformers are required to generate 3D descriptors of compounds, the most stable conformation of each compound was generated from SMILES by LigPrep (Schrödinger, LLC) [42]. The descriptors consist of physical properties (e.g., LogS, LogP, and ASA) and topological descriptors based on the graphical representation of the molecules. All the descriptors were standardized to mean μ = 0 and variance σ 2 = 1 with reference to small molecule training data.

Enumeration of extracted descriptor sets
It is important to extract better descriptors in terms of robustness and interpretation of the prediction model. The biophysical basis of PPBs must be the same for small molecules and cyclic peptides. However, in this case, a result of feature selection strongly depends on the training data (small molecules). In other words, descriptors specific to small molecules (i.e., those that cannot represent cyclic peptides) will be chosen. Thus, we present multiple results of feature selection. Enumerating lasso solutions (ELS) and forward beam search (FBS) were used as feature selection methods, and the generated models were compared to baseline models trained on all descriptors. Feature selection was performed using the small molecule dataset, followed by application of the extracted subsets of descriptors to two cyclic peptide datasets. This process is needed because there are insufficient cyclic peptide data for extracting and verifying the descriptors. An outline of the feature selection and sparse modeling is shown in Fig. 1(a).

Enumerating lasso solutions
Hara and Maehara [43] proposed a sparse modeling algorithm for enumerating solutions to the lasso regression problem (least-squares method with L 1 regularization). This enumerating lasso solutions (ELS) algorithm is summarized in Algorithm 1, where n is the number of dimensions of a feature space, P = {1, 2, …, n} is a set representing indices of all the features, Lasso(S) is the function that calculates the lasso solution that allows coefficients of features included in S ⊆ P to be non-zero, and supp(β) is a set of features with non-zero coefficients in the enumerated lasso solution β. This algorithm calculates a lasso solution different from β by removing the features of supp(β) one by one from S. The output candidate is added as a tuple (β, S, F) to the priority queue for the objective function value of the lasso. F is a set of features removed from P. The weight parameter of the L 1 regularization term is related to the sparseness of the lasso solution. The algorithm can output multiple results of feature selection, whereas ordinary lasso regression outputs only a single result. a b Fig. 1 Outline of the construction of the prediction models for PPB in this study. a The flow of feature selection and model construction using enumerating lasso solutions (ELS) and forward beam search (FBS). b The flow of evaluation for prediction performance on baseline models and proposed models

Forward beam search
We also used forward beam search (FBS), which can output multiple results by applying beam search to the forward-stepwise extraction. A ridge regression model (least-squares method with L 2 regularization) was used for each descriptor set. The residual sum of squares was used for the loss of each model. When the search was completed, the best results of the descriptor sets were output in ascending order of loss. This FBS algorithm is summarized in Algorithm 2. P = {1, 2, …, n} is a set representing indices of all the features. Ridge(S) is the function that calculates the ridge solution.

Parameters of ELS and FBS
To balance the prediction accuracy and interpretability, we selected the number of descriptors in an extracted set to be around 5. For ELS, the weight parameter of the L 1 regularization term was set to 0.13 so that the number of features with non-zero coefficients in lasso's global optimal solution was 5. The search depth of FBS was set to D = 5 so that each extracted set has just 5 descriptors. In addition, the weight parameter of the L 2 regularization term in FBS was set to 1.0. Furthermore, the number of enumerations of each method was set to K = 200, and the beam width of FBS was set to W = 300, which are determined for the following reasons. In the general tendency, a larger value of K increases the possibility of finding a better model. The number of descriptors we selected is around 4 to 5, and K needs to be set between 100 and 300. In this local range, selected features were the same. In other words, the prediction accuracy was the same; thus, K was set to 200. Similarly, a larger value of W will also result in a better model because the search range becomes wider. However, in the range of W = 200-400, the prediction accuracy was the same; thus, W was set to 300.

Model evaluation
The models were evaluated and compared based on the root mean squared error (RMSE) of the predicted f b , the mean absolute error (MAE) of the predicted f b , and the correlation coefficient (R) of ln K a . These metrics are defined in the equations below: where N is the number of data, i is the ln K a value converted from f Ã b;i , and ln K a and ln K a Ã are the mean values of (ln K a ) i and ð ln K a Þ Ã i , respectively.

Model construction Baseline models
To verify the effectiveness of the proposed method, four baseline models were constructed (white boxes in Fig. 1(b)). These models were trained on two different types of datasets (small molecule training dataset or cyclic peptide drug dataset), and baseline performances were obtained by predicting three datasets: small molecule test dataset (SM), cyclic peptide drug dataset (CP), and synthetic cyclic peptide dataset (SCP). Detailed schemes are described in Fig. 1(b).
The ridge model with all features and lasso model with five features were used to construct baseline models. Hereafter, we refer to their baseline models "regressor name-training data" (e.g., ridge-SM). In baseline #1 (ridge-SM) and baseline #2 (lasso-SM), the models were trained on the small molecule training dataset, and predicted three datasets. In baseline #3 (ridge-CP) and baseline #4 (lasso-CP), PPB values of the cyclic peptide drug dataset and the synthetic cyclic peptide dataset were predicted. When predicting cyclic peptide drug dataset by ridge-CP and lasso-CP, leave-one-out cross-validation (LOOCV) was also conducted.  Table 2 PPB prediction results of small molecules for sparse modeling by ELS and FBS compared to the baseline results. These situations are shown in a part of Fig. 1

Proposed models
To compare the two feature selection methods, ridge regression models were generated for all extraction results obtained using the small molecule training data. The L 2 regularization weight parameter was adjusted on the basis of 3-fold cross-validation with small molecule training data for each result of the descriptor subsets. This cross-validation is only used to select model parameters during training. The models having the smallest RMSE of test data from each of the two methods of feature selection were selected as the proposed models because it was assumed that the model of best prediction of unknown data explains the PPB. Under the same conditions as those for ridge-SM and lasso-SM, the prediction accuracies of the proposed models were also evaluated.

Distribution of the datasets
To visualize the datasets, a scatter plot with molecular weight (MW) and octanol/water partition coefficient (QPLogPo/w) is shown in Fig. 2. It was found that most of the cyclic peptides are larger than the small molecules. The distribution of the synthetic cyclic peptides was slightly similar to that of the small molecule dataset, compared to that of the cyclic peptide drug dataset.

Small molecules PPB modeling
Hereafter, we present the prediction results in each situation described in Fig. 1(b). The results of predicting the f b values of small molecule dataset are listed in Table 2, and this situation is the same as that in the work of Ingle et al. The prediction accuracy of ridge-SM and lasso-SM is similar to that in the work of Ingle et al. As for the proposed ELS and FBS, it was found that the prediction accuracy is as good as that of ridge-SM and lasso-SM. Figure 3 shows the boxplots of each RMSE in f b of the small molecule test data with 200 different models. Although the prediction accuracy of both ELS and FBS varied according to the selected descriptors, the variation in the RMSE of ELS and FBS was around 0.02 and 0.01, respectively. Figure 4 shows a scatter plot of the PPB prediction results for ELS and FBS. The correlation coefficients between experimental f b and estimated f b in ELS and FBS were 0.714 and 0.725, respectively.

Simple ridge and lasso regression for cyclic peptides PPB directly
The prediction results for ridge-CP and lasso-CP are listed in Table 3. The prediction accuracy degraded compared to that of ridge-SM and lasso-SM, as the number of the data is much smaller than that of the small molecule dataset. Internal validation with the cyclic peptide drug dataset was carried out in LOOCV  Table 3) and the results were allowable (MAE = 0.244-0.286). For the prediction model constructed based on the cyclic peptide drug dataset, however, the prediction model could not predict for the synthetic cyclic peptide dataset at all. In this case, the MAE of ridge-CP and lasso-CP was 0.354 and 0.627, respectively. Originally, the number of cyclic peptide samples was extremely small. This implies that using cyclic peptides as training data is inappropriate unless the number of data is increased.
Prediction for cyclic peptide drugs with small molecules using sparse modeling The results of predicting PPB values of cyclic peptide drugs using the models constructed with the small molecule training data are listed in Table 4. In particular, ridge regression prediction ridge-SM failed (MAE = 0.442), indicating the effectiveness of sparse modeling. Among the constructed models, ELS predicted PPB values of cyclic peptide drugs more accurately than other methods (MAE = 0.216). The random prediction by output f b from uniform distribution was compared with ELS and FBS. The cyclic peptide dataset is predicted 10,000 times through random prediction. The average MAE is 0.374. Compared with this value, ELS (MAE = 0.216, one-sided P-value = 0.0013 with unpaired t-test) and FBS (MAE = 0.288, one-sided P-value = 0.044 with unpaired t-test) are significantly better. Figure 5 shows a boxplot of RMSE of f b when predicting ln K a of cyclic peptide drug data in all models. Although the RMSEs of the worst model in ELS and FBS were similar, the RMSE of the best model in ELS was less than that of FBS.

Prediction for synthetic cyclic peptides with small molecules using sparse modeling
The results of predicting the synthetic cyclic peptide dataset using the models trained on the small molecule training data are listed in Table 5. As already seen in Fig. 2, the spatial distribution of the synthetic cyclic peptide dataset overlaps with that of the small molecule dataset; thus, this prediction result also shows good accuracy. Ridge-SM, lasso-SM, and ELS showed nearly equivalent accuracy, and FBS was particularly accurate (MAE = 0.194). The prediction accuracies of ridge-CP and lasso-CP were worse. Therefore it is reasonable to use a small molecule dataset with Table 3 Prediction results for ordinary lasso sparse modeling and ridge regression as baseline results under the several situations shown in a part of Fig. 1 Table 4 PPB prediction results of cyclic peptide drugs for sparse modeling by ELS and FBS compared to the baseline results. These situations are shown in a part of Fig. 1(b). The values with asterisk represent the best prediction performance in each evaluation criterion, and ridge-CP-LOO and lasso-CP-LOO lines are reproduced from Table 3 Method Training set Test set abundant data for predicting cyclic peptide PPB. Figure 6 shows a scatter plot of predicted f b and experimental f b of the two cyclic peptide datasets in each model with the smallest RMSE.

Comparison of ELS and FBS
Interestingly, the best model of ELS outperformed that of FBS in terms of prediction of cyclic peptides (Table 4), unlike the results of prediction with small molecules (Table 2) and the original synthetic cyclic peptides (Table 5). We analyzed the prediction accuracies in detail. The relationship between the prediction error of small molecule training data and that of small molecule test data is shown in Fig. 7. According to the figure, the averaged prediction accuracy for small molecule test data is nearly the same as that for training data in ELS, whereas the averaged prediction accuracy for small molecule test data is worse than that for the training data in FBS. This means that the models based on FBS are more overfitted to the training data. Thus, we concluded that models of ELS are more robust in predicting diverse molecules such as cyclic peptides, which have properties different from those of training data (small molecules). ELS can generate models having high generalization ability. This is important when predicting for cyclic peptides with the model trained on small molecules. Interestingly, the prediction tendencies of the best models of ELS and FBS are different. To compare the bias of prediction in cyclic peptide drug dataset and synthetic cyclic peptide dataset ( Fig. 6(a) and 6 3Þ are calculated by counting samples, where f b represents the predicted value and f Ã b represents the experimental value. In ELS, Prð These values seem to indicate that ELS often predicts lower f b than experimental f b but FBS exhibits an opposite tendency to that of ELS.

Feature set and prediction accuracy of the most predictable model
The extracted descriptors by our best models for small molecule test data of the two feature selection methods are summarized in Tables 6 and 7. The physical descriptors were compared with those extracted in the previous study by Ingle et al. [36]. Interestingly, most of the physical descriptors, such as the charge descriptor (PEOE), the surface descriptor (SASA, PISA), and the partition coefficient (LogPo/w, QPLogPo/w), are consistent; hence, we confirmed that  Table 5 PPB prediction results of synthetic cyclic peptides for sparse modeling by ELS and FBS compared to the baseline results. These situations are shown in a part of Fig. 1(b). The values with asterisk represent the best prediction performance in each evaluation criterion, and ridge-CP and lasso-CP lines are reproduced from Table 3 Method Training set Test set

PCA analysis with extracted descriptors
The distances between the molecules in the linear models can be estimated with the distance of the scatter plot of principal component analysis (PCA). Thus, we applied PCA for all molecules with ELS extracted descriptors that performed the best for cyclic peptide drug data. Figure 8 shows scatter plots of first and second principal components (PC1, PC2) of ELS extracted descriptors. Figure 8(a) shows small molecules and cyclic peptide drugs in the same PC space. Figure 8(b) shows only cyclic peptide drugs for the readers' benefit. Both small molecules and cyclic peptide drugs tend to have low f b when PC1 and PC2 are smaller and high f b when PC1 and PC2 are larger. Therefore, these features are considered to be similar explanations for PPB in both small molecule compounds and cyclic peptides. Although the region in which cyclic peptides are plotted is biased compared to small molecule compounds, this feature set may partially represent the important structure of cyclic peptides.
Good and bad cases for prediction Figure 9 shows four cyclic peptides as good and bad prediction cases. Oritavancin ( Fig. 9(a)) is typical good case (experimental f b is 0.85 and estimated f b is obtained as 0.84 by ELS). Pep.1 of the synthetic cyclic peptides ( Fig. 9(b)) is also a good case (experimental f b is 0.24 and estimated f b is obtained as 0.18 by ELS). Acetyl-daptomycin succeeded in predicting PPBs, as shown in Fig. 9(c). Daptomycin, shown in Fig. 9  is important. The PCA plot based on the selected descriptors showed that the distance between the points corresponding to daptomycin and acetyl-daptomycin was not large (Fig. 8(b)). Thus, it can be assumed that extracted descriptors do not evaluate the effect of the alkyl chain sufficiently well. We need some new descriptors that express a local or partial structure to predict the PPB value of cyclic peptides. Toward this end, it could be assumed that taking particular note of residues that are most likely to bind to HSA or other plasma proteins is a reasonable approach. We need to define "residues that are most likely to bind" and calculate descriptors covering these residues and surrounding structures. The difference in local alkyl chain was important for PPB in the case of Fig. 9. Xu index appeared in FBS model (Table 7) is known as one of topological descriptors for the molecular backbone [45,46]. Therefore, it may be possible to extract the difference well by evaluating it for the local structure. The correlation coefficient between f b and Xu index is 0.452. This medium correlation perhaps show that Xu index is important for providing an explanation of PPB.
However, the correlation between molecular weight and Xu index is very high (coefficient = 0.990). It is important to propose a novel descriptor that not merely reflects the total weight like Xu index but can express local structural difference. Accordingly, it might be possible to generate descriptors that focus on the local or partial structure of cyclic peptides and stand for the binding between plasma proteins and cyclic peptides.

Conclusions
This study aimed to predict the fraction bound to plasma proteins of cyclic peptides by using sparse modeling techniques in machine learning. Enumeration methods were utilized to predict PPB values of cyclic peptides with the model trained on experimental PPB data of small molecules. Two enumeration methods, enumerating lasso solutions (ELS) and forward beam search (FBS), were compared to four baseline models constructed using ridge and ordinal lasso regressions. Their prediction accuracies were evaluated with two cyclic peptide datasets: public cyclic peptide drugs obtained from DrugBank database and original cyclic peptides synthesized in this study, and proposed models showed better performance than baseline models (ELS obtained MAE value of 0.216 and 0.269 in cyclic peptide drugs and synthetic cyclic peptides, respectively, FBS obtained MAE value of 0.288 and 0.194 in cyclic peptide drugs and synthetic cyclic peptides, respectively). The prediction model constructed by the sparse modeling techniques, ELS and FBS, well achieved the aim of this study; that is, predicting PPB value of cyclic peptides.
The directions for future work are as follows. It is known that the local structure is important for predicting PPB of cyclic peptides, but the descriptor set of the most accurate model can only partly represent this structure.