Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition
© Huang et al.; licensee BioMed Central Ltd. 2012
Published: 13 December 2012
Existing methods for predicting protein solubility on overexpression in Escherichia coli advance performance by using ensemble classifiers such as two-stage support vector machine (SVM) based classifiers and a number of feature types such as physicochemical properties, amino acid and dipeptide composition, accompanied with feature selection. It is desirable to develop a simple and easily interpretable method for predicting protein solubility, compared to existing complex SVM-based methods.
This study proposes a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores of sequences for predicting protein solubility. SCM calculates the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins of a training data set. Consequently, the propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The solubility score of a sequence is determined by the weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons, four data sets with different sizes and variation degrees of experimental conditions were used. The results show that the simple method SCM with interpretable propensities of dipeptides has promising performance, compared with existing SVM-based ensemble methods with a number of feature types. Furthermore, the propensities of dipeptides and solubility scores of sequences can provide insights to protein solubility. For example, the analysis of dipeptide scores shows high propensity of α-helix structure and thermophilic proteins to be soluble.
The propensities of individual dipeptides to be soluble are varied for proteins under altered experimental conditions. For accurately predicting protein solubility using SCM, it is better to customize the score card of dipeptide propensities by using a training data set under the same specified experimental conditions. The proposed method SCM with solubility scores and dipeptide propensities can be easily applied to the protein function prediction problems that dipeptide composition features play an important role.
The used datasets, source codes of SCM, and supplementary files are available at http://iclab.life.nctu.edu.tw/SCM/.
Many proteins are produced in the form of insoluble aggregation that is a major obstruct for a lot of experiments, and the misfolded aggregation is called inclusion body. Many proteins form inclusion bodies when overexpressed in Escherichia coli (E. coli). These insoluble proteins need be solubilized and refolded to obtain functional proteins . Protein solubility defined as the concentration of soluble proteins varies widely, ranging from almost complete insolubility to values of several hundreds of milligrams per milliliter under given experimental conditions of pH, temperature, buffer concentration, and additives .
Protein solubility is a major concern when making biochemical experiments. Accordingly, researchers usually do their possible efforts to get the soluble forms of proteins by regulating experimental conditions, including culture temperature, co-expression with solubility-enhanced proteins, efficient vectors, and host strains. All about adjustments in experimental conditions that in order to get soluble proteins are still trial-and-error procedures. There is a significant need for highly consistent and accurate methods for predicting solubility of proteins from sequences .
Due to various extrinsic and intrinsic factors that influence protein solubility, it is difficult to develop an accurate and universal prediction method for estimating protein solubility and change upon point mutation. Generally, computational sequence-based prediction methods focus on the intrinsic determination of solubility for proteins overexpressed in E. coli at the normal growth temperature of 37°C. Numerous studies aim to investigate the features which correlate well with solubility for designing accurate prediction algorithms.
Many studies show that the amino acid sequence play a crucial role in determining solubility of expressed proteins. That is confirmed by experiments that point mutations in an expressed protein sequence could change the expressed solubility status under the same experimental conditions [1, 4–6]. So it can be known clearly that the primary structure is related to the propensity of a protein to form inclusion body or not in some way.
Many researchers predict solubility of expressed proteins in E. coli from their primary structures. The first predictive model with a regression analysis  used a database of 81 proteins and 6 parameters, including turn forming residue fraction, charge average, cysteine fraction, hydrophilicity index, proline fraction, and molecular weight. Davis et al.,  found that only two of six parameters in , the turn forming residues and the charge average, influenced the solubility of overexpressed proteins in E. coli. Idicula-Thomas and Balaji adopted a discriminant analysis using 170 proteins and found that the most important parameters are threonine, asparagine and tyrosine fraction, aliphatic index, and tripeptide and dipeptide composition . Idicula-Thomas et al.,  proposed a support vector machine (SVM) based learning algorithm to predict protein solubility by evaluating three feature sets. The best accuracy of 72% is obtained using the set of 446 features, consisting of 20 reduced αbet sets, 6 physicochemical properties, 20 residues, and 400 dipeptides where 8000 tripeptide-composition features have no improvement in prediction accuracy .
Smialowski et al., established a large dataset and proposed a two-layered predictor PROSO combining SVM and Naive Bayes classifiers . Magnan et al., used a huge dataset of 17,408 protein sequences and developed a two-stage SVM classifier using SVM and Naive Bayes classifiers . Diaz et al., employed logistic regression with 32 features which potentially correlate well with solubility and established a dataset of 212 proteins where the solubility status is confirmed by biological experiments . Chan et al.,  predicted solubility of expressed proteins using SVM with accuracy of 83.51% where the dataset of 726 protein sequences is the combination of 6 different fusion tags and 121 target proteins. Smialowski et al., proposed a two-layered method PROSO II using a primary Parzen model and a logistic regression classifier for protein solubility prediction .
The motivation of this study arises mainly from the following aspects: 1) the features of amino acid and dipeptide composition are useful for solubility prediction, but there are very few studies on estimating propensities of individual residues and dipeptides to be soluble; 2) it is also desirable to know the relationship between protein solubility and some biochemical and physicochemical properties of amino acid residues; 3) the existing SVM-based classifiers with a set of selected features have high generalization ability and prediction accuracy, but they suffer from low interpretability of insight to solubility; and 4) a simple and easily interpretable prediction method with an acceptable accuracy is more useful.
In this study, we propose a novel scoring card method (SCM) by using dipeptide composition only to estimate solubility scores for predicting protein solubility from sequences. SCM estimates and optimizes the propensities of 400 individual dipeptides to be soluble using statistic discrimination between soluble and insoluble proteins, and an intelligent genetic algorithm  by maximizing prediction accuracy, respectively. The solubility score of a protein can be simply determined by using a weighted sum of all propensity scores and dipeptide composition. To evaluate SCM by performance comparisons with existing SVM-based classifiers, four data sets with different sizes and variation degrees of experimental conditions were used.
By analyzing the relationship between the 531 physicochemical properties in the AAindex  and the estimated solubility scores of residues, we can get some insights to protein solubility. For example, the properties of the hydrophobicity group have a wide range of correlation coefficient (R value) in [-0.31, 0.35]. This scenario agrees with the inconsistence of literature reports about the propensity of hydrophobicity due to different experimental conditions . The property with the largest value R = 0.83 is the distribution (i.e., percentages) of amino acid residues in the α-helices in thermophilic proteins. This finding agrees with the high propensity of α-helices structure and thermophilic proteins reported in literature .
The performance comparison results show that the proposed SCM is effective for predicting protein solubility, compared with existing state-of-the-art SVM-based methods. The SCM method has potential ability to generate various score cards of dipeptides to predict protein functions where the features of dipeptide and amino acid composition play an important role in the prediction, such as the prediction of carbohydrate-binding proteins . There are numerous potential applications of SCM to protein function prediction problems such as crystallization , predictions of subcellular localization and nuclear receptors , virulent protein , protein structure class and ion channel [21, 22], and gene expression level .
In this study, we do our best effort to utilize four data sets with different sizes and variation degrees of experimental conditions for evaluating the proposed method SCM. The first data set Sd957 is established by the authors that the solubility status is confirmed by biological experiments, and the other three data sets were the same with the existing studies [11, 13] and , for performance comparisons.
Data sets Sd957 and Sd726
Expressed proteins with solubility states were collected based on literature reports [11–13], which were all expressed at the normal growth temperature of 37°C. Only the proteins used in previous work that the solubility status is confirmed by biological experiments were considered in this data set. The dataset called Sd957 consists of 285 soluble proteins and 672 insoluble proteins, collected mainly from three parts.
In the first part, a combination of the keywords inclusion bodies, soluble, E. coli, and overexpression was used to search PubMed for identifying proteins which have been overexpressed in E. coli under the normal growth condition . The second part comes from the dataset of 212 proteins, including 52 soluble proteins and 160 inclusion bodies . The proteins in the two parts mentioned above have no fusion tags. The third part comes from the used dataset of 726 protein sequences in  (named Sd726) that the dataset is the combination of six different fusion tags and 121 target proteins. Different fusion tags combined with the same target protein may bring the expressed protein resulting in distinct status. There are 980 proteins after integration of the three parts. After filtering by deleting duplicate proteins, 957 proteins remain in the final dataset. The used dataset is available at http://iclab.life.nctu.edu.tw/SCM/.
Data set SOLproDB
This dataset SOLproDB with 17408 (8704 soluble and 8704 insoluble) proteins is presented in  that were collected from major protein databases such as Protein Data Bank (PDB), SwissProt, TargetDB and literature report . Although the study  assumes that SOLproDB comes from the same experimental condition, the proteins from TargetDB possibly have ~20% of protein sequences which are expressed using different hosts. After removing protein sequences which contain unknown amino acid residues, this dataset comprises 16902 (8212 soluble and 8690 insoluble) proteins. Then, the comparison results between SCM and SOLproDB are obtained using this refined dataset.
Data set SdPROSOII
Smialowski et al., proposed a two-layered method PROSOII using a primary Parzen model and a logistic regression classifier for protein solubility prediction .
The data set SdPROSOII is used by  consisting of 82,299 proteins with 90% sequence identity. SdPROSOII is established by selecting proteins from the pepcDB and PDB databases. For performance comparison, the sequence identity of soluble and insoluble sets separately is further reduced at the sequence identity 25% as  using the CD-HIT program .
The proposed method SCM
The proposed scoring card method SCM is an efficient and generalized method for creating various kinds of dipeptide scoring cards for predicting protein functions from whole sequences. The suitable prediction problems are those that the amino acid and dipeptide composition play an important role in serving as significantly effective features. The description of SCM is given in a general-purpose algorithm without using heuristics or specific domain knowledge. The SCM method can be applied to other prediction problem without significant modifications. Of course, the generic score matrix of dipeptides can be further customized and utilized with other complementary features for advancing prediction accuracy.
Creation of data sets
A 10-fold cross validation experiment is adopted to evaluate SCM for predicting protein solubility. For each specified data set, a scoring matrix of dipeptides is customized in the SCM method. The dataset sd957 with the solubility status confirmed by biological experiments is used to illustrate SCM and analyze the scoring matrix of dipeptides. The dataset sd957 is randomly divided into 766 training (219 soluble, 547 insoluble) and 191 test (50 soluble, 141 insoluble) proteins. The training data set is used for optimizing the solubility scoring matrix (SSM) and determining the suitable threshold value for classifying the query sequence as soluble or insoluble proteins.
Initial scoring matrix using a statistical approach
The solubility scoring matrix (SSM) of dipeptides consisting of 400 dipeptide scores is generated using a coarse-to-fine approach. The initial SSM is created by using a statistical approach based on the dipeptide composition and then the final SSM is optimized by using an intelligent genetic algorithm (IGA) . The initial SSM is obtained using the following algorithm. The input is the two classes of soluble and insoluble sequences. The output is an initial SSM of dipeptides. The larger the solubility score of the dipeptide, the larger contribution to the propensity of a protein is to be soluble.
Step 1: Calculate the numbers of 400 dipeptides in each class. For example, the numbers of dipeptide AA in soluble and insoluble classes are 1067 and 1833, respectively.
Step 2: Normalize the dipeptide composition by dividing the numbers using the total numbers of dipeptides in each class. For example, the total numbers of dipeptides in soluble and insoluble classes are 97,147 and 217,263, respectively. Therefore, the compositions of AA are 0.01098 and 0.0084, respectively.
Step 3: The scores of SSM for an individual dipeptide are obtained by subtracting the score of the insoluble class from that of the soluble class. For example, the score of AA is 0.00258 (= 0.01098 - 0.0084).
Step 4: Normalize the scores of all dipeptides into the range [0, 1000]. The score of AA is 794.
The scores of dipeptides in SSM are highly correlated to the relative contribution of dipeptides to protein solubility prediction using SCM that is first presented in literature. To further quantify the relative contribution of each amino acid to protein solubility, we average the scores of dipeptides AX and XA where X can be any amino acid and assign the averaged score to the amino acid A. The SSM of amino acids can be therefore derived. If the amino acid composition (i.e., percentages) of a certain protein has a high correlation with the SSM of amino acids, this protein is easy to predict as a soluble protein.
Optimized solubility scoring matrix
The initial SSM is further optimized by using IGA, an efficient evolutionary algorithm for solving large parameter optimization problem. In this problem, 400 real-valued variables for encoding the dipeptide scores to be optimized. For applying IGA to parameter optimization problem, both the fitness function (or called objective function in optimization algorithm) and chromosome representation in which the parameters are encoded need to be specified. After designing the fitness function and chromosome representation, the IGA algorithm of SCM is also given, described below.
Fitness function and chromosome representation
The fitness function of SCM comprises two parts that concern both consistency and accuracy. To increase consistency, a Pearson's correlation coefficient (R value) between the optimized SSM and the initial one of amino acids should be maximized. This criterion is derived from the hypothesis that the initial SSM of amino acids has meaningful information and should be conserved provided that the training data set is sufficiently large with nearly the same experimental conditions.
where W1 and W2 are user-defined weights for the multi-objective fitness function. In this work, W1 and W2 are set to 0.9 and 0.1, respectively, after evaluating other weight combinations (see Results section).
All the 400 real-valued variables are encoded into a chromosome of IGA where each variable belongs to the range [0, 1000]. For obtaining a high generalization ability of SSM for independent test, a 10-folds cross-validation assessment is utilized in evaluating the fitness function. IGA uses a divide-and-conquer strategy to solve large-scale optimization problems . The detailed method can be referred to the work [15, 26, 27].
IGA algorithm of SCM
The initial SSM is obtained from the statistical method based on the training data set mentioned above. The IGA algorithm of SCM for obtaining an optimized SSM is described as follows:
Step 1: (Initialization) Randomly generate Npop individuals including the initial SSM. In this study, Npop = 40.
Step 2: (Evaluation) Compute fitness values of all individuals where Ibest is the best individual in the population.
Step 3: (Selection) Use a rank-based selection to select Ps·Npop individuals to establish a mating pool. In this study, Ps = 1.0.
Step 4: (Crossover) Perform the intelligence crossover operation  for each individual with Ibest to find the best two individuals among two parents and two children as the new children (the elitist strategy).
Step 5: (Mutation) Use a real-valued mutation operator to randomly mutate individuals with a mutation probability Pm (= 0.01). Mutation is not applied to Ibest to prevent the best fitness value from deteriorating.
Step 6: (Termination test) If a given termination condition is satisfied, stop this algorithm. Otherwise, go to Step 2. In this study, 20 generations are used as the stop condition.
Besides IGA, other efficient existing optimization algorithms are also available in achieving this goal of optimizing the SSM.
4) Prediction of protein solubility
If S(P) is greater than the given threshold value, P is classified as a soluble protein; otherwise, P is insoluble.
Results and discussion
Performance of solubility prediction using SCM
Effects of weights W1 and W2
The tested weight pairs (W1, W2) are (0.8, 0.2), (0.9, 0.1), and (1.0, 0) for evaluating SCM using the two representative data sets: Sd957 with similar experimental conditions and SOLproDB with diverse experimental conditions. The data set SOLproDB was also randomly divided into two data sets consisting of 8451 training (4106 soluble, 4345 insoluble) and 8451 test (4106 soluble, 4345 insoluble) sequences. To deal with the well-known undeterministic problem of genetic algorithms (GAs, i.e., the outcomes of GAs are not always the same due to the use of random numbers) , SCM was performed 10 independent runs on the training data sets of Sd957 and SOLproDB and the SSM with the highest accuracy is selected as the final optimized SSM.
The performance of SCM with different pairs of weights on two data sets Sd957 and SOLproDB.
Classifier (W1, W2)
SCM (0.8, 0.2)
SCM (0.9, 0.1)
SCM (1.0, 0)
Performance evaluation of SCM
10 independent runs of the scoring card method on Sd957.
Performance comparisons between SCM and SVM using the same dipeptide composition.
Comparing SCM with existing methods
Data set Sd726
To compare with the existing method implemented on the data set with similar experimental conditions , we implemented the same SCM method using the data set Sd726. The performance assessment is implemented in the similar way as the SVM-based method . The data set was randomly divided into training and test data sets for 10 times. Each of the training data sets was independently used for optimizing SSMs. The results of SCM on Sd726 for the 10 independent runs are given in Table S2 [see Additional file 2]. In this experiment, SCM can achieve a mean test accuracy of 83.48%. On the other hand, the SVM-based method  using 617 features consisting of 84-nucleotide composition, 71 post-translational modifications, 400 dipeptides, and 62 nucleotide and protein features used in previous works achieved a test accuracy of 83.51%. The results reveal that the SCM method using the optimized SSM with dipeptide composition features only is comparable to the SVM-based method using a number of feature types.
Data sets SOLproDB and SdPROSOII
The cross-performance comparison between SCM and SOLpro.
SCM is also compared with the newly published method PROSO II  using the new data set SdPROSOII with sequence identity 25%. The 10-fold cross-validation accuracy of PROSO II is 69.9% obtained from  where SCM has a mean accuracy of 64.36%. Because the holdout data set in  is not available, we performed an independent test experiment by dividing SdPROSOII into two equal subsets for training and test. The training and test accuracies are 66.55% and 64.50%, respectively. Notably, the method PROSO II is a two-layered structure where the output of a primary Parzen window model and a logistic regression classifier serve as input of a second-level logistic regression classifier . These results reveal the advantages of SCM, simplex, interpretability, and accurateness, compared to the much more complex method.
Propensity analysis of dipeptides and amino acids
The initial solubility scoring matrix of amino acids.
The turn forming residues Asn, Gly, Pro and Ser having a high propensity to be insoluble favor inclusion body formation [7, 8]. The amino acids Pro (P), Asn (N), Ser (S) and Gly (G) have scores ranked at 9, 15, 17 and 18 that agree with the propensity to be insoluble. It has been proved that insoluble proteins contain less negative charged amino acids (Glu and Asp) . The amino acids Glu and Asp have scores ranked at 2 and 3, respectively (Table 5). The study  observed that insoluble proteins more frequently had fewer negatively charged residues.
The 10 top-ranked dipeptides and their scores are LA, IP, MC, QE, GT, DH, LT, PE, GF and CA, with scores 1000, 997, 991, 989, 988, 987, 983, 977, 977, and 972, respectively. The dipeptides with high scores play an important role in increasing solubility. For example, the dipeptide GF recognized by the study  has high relation to the Kinetics of degradation and oil solubility of ester prodrugs of a model dipeptide (Gly-Phe). The 20 amino acid values of the property KUMS000103 are given in Table 5, which are the percentages of amino acid residues in the α-helices in thermophilic proteins. The amino acid residues Leu (9.1%) and Ala (14.1%) are the two top-ranked ones having the highest percentages. The dipeptide LA has the highest score 1000. It agrees with the high propensity of α-helix structure and thermophilic proteins to be soluble . Notably, the dipeptide with the smallest score (0) is SS where S is a turn forming residue favoring inclusion body formation [7, 8].
Propensity analysis of physicochemical properties
Selected R values between some interesting physicochemical properties and the optimized SSM of amino acids.
KUMS000103 Distribution of amino acid residues in the α-helices in thermophilic proteins
PRAM900102 Relative frequency in α-helix
MUNV940102 Free energy in α-helical region
HOPT810101 Hydrophilicity value
KUHL950101 Hydrophilicity scale
LEVM760101 Hydrophobic parameter
CIDH920103 Normalized hydrophobicity scales for α+β-proteins
CASG920101 Hydrophobicity scale from native protein structures
OOBM850102 Optimized propensity to form reverse turn
PALJ810116 Normalized frequency of turn in α/β class
ROBB760108 Information measure for turn
FAUJ880112 Negative charge
FAUJ880111 Positive charge
CHAM830108 A parameter of charge transfer donor capability
KUMS000103 Distribution of amino acid residues in the α-helices in thermophilic proteins
FUKS010109 Entire chain composition of amino acids in intracellular proteins of thermophiles (percent)
FUKS010105 Interior composition of amino acids in intracellular proteins of thermophiles (percent)
Table 6 gives the correlation values among the optimized SSM of amino acids and some interesting properties. From Table 6, the two group of α-helix and thermophilic properties have large mean values of R = 0.37 and 0.52, respectively. The properties of hydrophobicity group have a mean value of R = -0.09 and a wide range of R in [-0.31, 0.35]. This scenario agrees with the inconsistence of literature reports about the propensity of hydrophobicity due to different experimental conditions [9, 32].
The aliphatic amino acids (including Ala, Ile, Leu, Pro and Val) are found that the appearance proportion is much higher in the thermophilic bacteria than other amino acids . So they can be regarded as thermostability indicator of proteins. It is suggested that an increase in the thermostability of proteins might favor an increase in their solubility due to that solubility on overexpression and thermostability have a positive correlation . The five aliphatic amino acids are the top-ten residues according to their scores (Table 5). The analysis results reveal that the SSMs of amino acids and dipeptides are informative and can be used to investigate the solubility and change upon point mutation.
Distribution of top-ranked dipeptides on sequences
This study has proposed a novel scoring card method (SCM) to estimate solubility scores of dipeptides and amino acid residues from a large dataset of sequences for predicting solubility of proteins and analyzing the propensity of physicochemical properties. The solubility scoring matrices (SSMs) of dipeptides and amino acids are easily manipulated. The classification method is very simple and the prediction result is easily interpretable. The SCM with SSMs performs well in predicting solubility, compared with existing complex methods using a large number of complementary features which correlate well with solubility. Furthermore, the propensity of physicochemical properties and the relative contribution to protein solubility are also analyzed by using the correlation value R. The results agreeing with the literature reports reveal that the SSMs are effective.
Since the solubility is influenced by various condition factors such as pH, temperature, buffer concentration, and various additives, the obtained SSM of dipeptides is only a generic matrix. If a customized SSM is needed, the datasets of protein solubility for specific expression conditions can be appended and the generic SSM can be tuned by using SCM. Since the proposed SCM method is effective for generating SSMs to predict protein solubility, the future work is to apply SCM to generate various kinds of scoring matrices of dipeptides for investigating protein function prediction problems where the features of dipeptide and amino acid composition play an important role.
The authors would like to thank the National Science Council of Taiwan for financially supporting this research under the contract numbers 100-2627-B-009-004- and 100-2221-E-009-143-, and "Center for Bioinformatics Research of Aiming for the Top University Program" of the National Chiao Tung University and Ministry of Education, Taiwan, R.O.C for supporting projects 101W962 and 100W9700. This work was also supported in part by the UST-UCSD International Center of Excellence in Advanced Bioengineering sponsored by the Taiwan National Science Council I-RiCE Program under Grant Number: NSC-101-2911-I-009-101.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
- Pedelacq JD, Piltch E, Liong EC, Berendzen J, Kim CY, Rho BS, Park MS, Terwilliger TC, Waldo GS: Engineering soluble proteins for structural genomics. Nat Biotechnol. 2002, 20 (9): 927-932. 10.1038/nbt732.View ArticlePubMedGoogle Scholar
- Trevino SR, Scholtz JM, Pace CN: Amino acid contribution to protein solubility: Asp, Glu, and Ser contribute more favorably than the other hydrophilic amino acids in RNase Sa. J Mol Biol. 2007, 366 (2): 449-460. 10.1016/j.jmb.2006.10.026.PubMed CentralView ArticlePubMedGoogle Scholar
- Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV: A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006, 22 (3): 278-284. 10.1093/bioinformatics/bti810.View ArticlePubMedGoogle Scholar
- Dale GE, Broger C, Langen H, D'Arcy A, Stuber D: Improving protein solubility through rationally designed amino acid replacements: solubilization of the trimethoprim-resistant type S1 dihydrofolate reductase. Protein Eng. 1994, 7 (7): 933-939. 10.1093/protein/7.7.933.View ArticlePubMedGoogle Scholar
- Jenkins TM, Hickman AB, Dyda F, Ghirlando R, Davies DR, Craigie R: Catalytic domain of human immunodeficiency virus type 1 integrase: identification of a soluble mutant by systematic replacement of hydrophobic residues. Proc Natl Acad Sci USA. 1995, 92 (13): 6057-6061. 10.1073/pnas.92.13.6057.PubMed CentralView ArticlePubMedGoogle Scholar
- Murby M, Samuelsson E, Nguyen TN, Mignard L, Power U, Binz H, Uhlen M, Stahl S: Hydrophobicity engineering to increase solubility and stability of a recombinant protein from respiratory syncytial virus. Eur J Biochem. 1995, 230 (1): 38-44. 10.1111/j.1432-1033.1995.tb20531.x.View ArticlePubMedGoogle Scholar
- Wilkinson DL, Harrison RG: Predicting the solubility of recombinant proteins in Escherichia coli. Biotechnology (N Y). 1991, 9 (5): 443-448. 10.1038/nbt0591-443.View ArticleGoogle Scholar
- Davis GD, Elisee C, Newham DM, Harrison RG: New fusion protein systems designed to give soluble expression in Escherichia coli. Biotechnol Bioeng. 1999, 65 (4): 382-388. 10.1002/(SICI)1097-0290(19991120)65:4<382::AID-BIT2>3.0.CO;2-I.View ArticlePubMedGoogle Scholar
- Idicula-Thomas S, Balaji PV: Understanding the relationship between the primary structure of proteins and its propensity to be soluble on overexpression in Escherichia coli. Protein Sci. 2005, 14 (3): 582-592. 10.1110/ps.041009005.PubMed CentralView ArticlePubMedGoogle Scholar
- Smialowski P, Martin-Galiano AJ, Mikolajka A, Girschick T, Holak TA, Frishman D: Protein solubility: sequence based prediction and experimental verification. Bioinformatics. 2007, 23 (19): 2536-2542. 10.1093/bioinformatics/btl623.View ArticlePubMedGoogle Scholar
- Magnan CN, Randall A, Baldi P: SOLpro: accurate sequence-based prediction of protein solubility. Bioinformatics. 2009, 25 (17): 2200-2207. 10.1093/bioinformatics/btp386.View ArticlePubMedGoogle Scholar
- Diaz AA, Tomba E, Lennarson R, Richard R, Bagajewicz MJ, Harrison RG: Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol Bioeng. 2010, 105 (2): 374-383. 10.1002/bit.22537.View ArticlePubMedGoogle Scholar
- Chan WC, Liang PH, Shih YP, Yang UC, Lin WC, Hsu CN: Learning to predict expression efficacy of vectors in recombinant protein production. BMC Bioinformatics. 2010, 11 (Suppl 1): S21-10.1186/1471-2105-11-S1-S21.PubMed CentralView ArticlePubMedGoogle Scholar
- Smialowski P, Doose G, Torkler P, Kaufmann S, Frishman D: PROSO II - a new method for protein solubility prediction. FEBS J. 2012, 279 (12): 2192-2200. 10.1111/j.1742-4658.2012.08603.x.View ArticlePubMedGoogle Scholar
- Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. IEEE Transactions on Evolutionary Computation. 2004, 8 (6): 522-541. 10.1109/TEVC.2004.835176.View ArticleGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36 (Database): D202-205.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee H-C, Liou Y-F, Charoenkwan P, Ho S-J, Shu L-S, Ho S-Y, Huang H-L: Prediction of carbohydrate-binding proteins using a scoring card method. The 6th International Conference on Bioinformatics and Biomedical Engineering (iCBBE 2012). 2012Google Scholar
- Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S: CRYSTALP2: sequence-based protein crystallization propensity prediction. BMC Structural Biology. 2009, 9 (50):Google Scholar
- Bhasin M, Raghava GP: Classification of nuclear receptors based on amino acid composition and dipeptide composition. J Biol Chem. 2004, 279 (22): 23262-23266. 10.1074/jbc.M401932200.View ArticlePubMedGoogle Scholar
- Muley SB, Bastikar V, Bothe S, Meshram A, Roy N: Virulence prediction model (virprob) using amino acid and dipeptide composition for human pathogens. Journal of Biophysics and Structural Biology. 2011, 3 (1): 24-29.Google Scholar
- Chen K, Kurgan LA, Ruan J: Prediction of protein structural class using novel evolutionary collocation-based sequence representation. J Comput Chem. 2008, 29 (10): 1596-1604. 10.1002/jcc.20918.View ArticlePubMedGoogle Scholar
- Lin H, Ding H: Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J Theor Biol. 2011, 269 (1): 64-69. 10.1016/j.jtbi.2010.10.019.View ArticlePubMedGoogle Scholar
- Raghava GP, Han JH: Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein. BMC Bioinformatics. 2005, 6: 59-10.1186/1471-2105-6-59.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W: CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010, 26 (5): 680-682. 10.1093/bioinformatics/btq003.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradley AP: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997, 30 (7): 1145-1159. 10.1016/S0031-3203(96)00142-2.View ArticleGoogle Scholar
- Ho SY, Chen JH, Huang MH: Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications. IEEE Trans Syst Man Cybern B Cybern. 2004, 34 (1): 609-620. 10.1109/TSMCB.2003.817090.View ArticlePubMedGoogle Scholar
- Tung CW, Ho SY: POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties. Bioinformatics. 2007, 23 (8): 942-949. 10.1093/bioinformatics/btm061.View ArticlePubMedGoogle Scholar
- Chang C-CaL, Chih-Jen : LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2 (3): 27:21--27:27.View ArticleGoogle Scholar
- Christendat D, Yee A, Dharamsi A, Kluger Y, Savchenko A, Cort JR, Booth V, Mackereth CD, Saridakis V, Ekiel I: Structural proteomics of an archaeon. Nat Struct Biol. 2000, 7 (10): 903-909. 10.1038/82823.View ArticlePubMedGoogle Scholar
- Larsen SW, Ankersen M, Larsen C: Kinetics of degradation and oil solubility of ester prodrugs of a model dipeptide (Gly-Phe). Eur J Pharm Sci. 2004, 22 (5): 399-408. 10.1016/j.ejps.2004.04.013.View ArticlePubMedGoogle Scholar
- Costantini S, Colonna G, Facchiano AM: Amino acid propensities for secondary structures are influenced by the protein structural class. Biochem Biophys Res Commun. 2006, 342 (2): 441-451. 10.1016/j.bbrc.2006.01.159.View ArticlePubMedGoogle Scholar
- Trevino SR, Scholtz JM, Pace CN: Amino acid contribution to protein solubility. J Mol Biol. 2009Google Scholar
- Ikai A: Thermostability and aliphatic index of globular proteins. J Biochem. 1980, 88 (6): 1895-1898.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.