Learning to predict expression efficacy of vectors in recombinant protein production
© Chan et al. 2010
Published: 18 January 2010
Skip to main content
© Chan et al. 2010
Published: 18 January 2010
Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression.
In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production.
In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.
Acquiring large quantities of a desired protein in situ from original host cells is not trivial. Moreover, gene over-expression and purification of corresponding proteins in a soluble form are important for structural and functional proteomics. Recombinant protein production is an important applicable procedure in biotechnology and one of the few ways to over-express a given protein coding sequence of interest. To date, Escherichia coli (E. coli), one of Gram-negative bacteria, is still an approachable and favored host for cloning and expressing a given protein in many occasions [1–3]. In recent years, a variety of studies have developed well-established large-scale and high-throughput systems to obtain large quantities of soluble recombinant proteins [4–6]. However, it has been reported that a large number of foreign heterologous proteins were expressed at relatively low levels  or difficult to solubilize , in E. coli. These over-expressed proteins in an insoluble form are termed as inclusion bodies. Since the refolding procedure of inclusion bodies to recover a soluble form of recombinant proteins is time-consuming, expression of insoluble protein aggregates is frequently a major obstacle in recombinant protein production.
Hence, to meet the demands of preventing inclusion body formation in recombinant expression systems, many researchers have dedicated their efforts to optimize the growth conditions, such as buffer composition, protein concentration, and cultivation temperature. While others have focused on improving the folding probabilities regarding enhancement of mRNA stability, over-expression of rare-codon tRNA, selection of efficient vectors and host strains, and co-expression with solubility-enhanced proteins . Nevertheless, more studies have emphasized the importance of increasing the solubility of recombinant proteins in E. coli by fusing them to highly soluble carrier proteins [9–12]. Because the only way to select a specific match between a target protein and an appropriate fusion partner that will lead to a soluble form is still a trial-and-error process; a more systematic approach is required.
Regardless of fusing different vectors, most previous works have attempted to predict the propensity of a given protein to be soluble or not in E. coli. The first such study was conducted by Wilkinson and Harrison  with a regression model analysis. They concluded five amino acid-dependent factors are discriminative features that correlate to inclusion bodies formation. There were charge average approximation (Asp, Glu, Lys and Arg), turn-forming residue fraction (Asn, Gly, Pro and Ser), cysteine and proline fractions, hydrophilicity and molecular weight. In a subsequent study, Davis et al. have improved the statistical solubility model of solubility in E. coli by demonstrating that the first two parameters were more critical than other three . Additionally, based on the undertaking of structural genomics projects, Bertone et al. have applied machine learning techniques such as decision trees and Support Vector Machines (SVMs) to discover other informative features based on 562 proteins from Methanobacterium thermoautotrophicum. Among these critical parameters, low content of negative residues (DE <17%) and presence of hydrophobic patches are associated with insoluble protein formation . Subsequently, Goh et al. utilized random forest and decision-tree based methods on about 27, 000 protein targets in TargetDB  to conclude that the most significant protein feature was serine percentage composition . Furthermore, Luan et al. collected 1, 536 soluble proteins out of 10,167 ORFs in Caenorhabditis elegans expressed by single vector and one E. coli strain. In their study, the most prominent protein feature was GRAVY (Grand Average of Hydropathicity, an indicator for average hydrophobicity of a protein) .
To date, many studies have showed that Support Vector Machines (SVMs) combining with appropriate kernels frequently result in better performance for biological sequence classification than other methods based on statistical learning theory [19, 20]. Recently, many studies have tried to apply SVMs to circumvent the problem of assessing the propensity of target proteins to be actively soluble or to form inclusion body in E. coli. According to their previously observed sequence-dependent features in protein levels, Idicula-Thomas et al. provided a SVM-based approach to achieve 72% in prediction accuracy [21, 22]. Additionally, Smialowski et al. developed PROSO, a two-layered predictor combining SVM and Naive Bayes classifiers, and obtained a compatible performance similar to Idicula-Thomas et al. .
The studies mentioned above have at least two basic postulations: 1) a given gene was thought to have been over-expressed and 2) the expression level of a target gene was the same whatever by fusing different vectors in E. coli. Consequently, most previous works only focused on demonstrating important factors related to solubility prediction and mixed the cases of target genes in inclusion fraction and non-expression to form a negative set. However, recent research has reported that recombinant proteins expressed as inclusion bodies still keep biological activity than previously appreciated . Thus, it is still significant to distinguish inclusion bodies from the negative set in previous studies. Moreover, it has also been assumed that all given proteins obtained the same expression result regardless of the fusion of different vectors in E. coli as it only focused on predicting the solubility of a target gene by its protein sequence. However, a given protein usually yielded different expression levels by fusing different vectors. Note that different groups have discovered different crucial protein factors according to soluble target proteins acquired from their own experiments. It would be partly because their experimental data were conducted under different expression conditions, such as the focus of fusing different vectors in this work. Thus, Kataeva et al. reported that the sensitivity in previous works on predicting the solubility of recombinant proteins was much lower than the specificity .
Data distribution of three expression levels in six fusion vectors. 726 cases comprised 121 genes fused into six fusion vectors generated from HTP systems.
As shown in Figure 1, three expression levels of recombinant proteins in SDS-PAGE experiments were soluble fraction, inclusion fraction, and non-expression. Hence, after screening solubility of 121 target proteins in six different fusion vectors, the model of expression and solubility of entire recombinant proteins, including given genes and fusion vectors, was formulated as a three-class classification problem. In this work, three SVM-based methods were proposed to tackle a three-class classification problem in three different aspects of using the hierarchical structure. Furthermore, based on our experimental data of over-expression of given genes in different fusion vectors in E. coli, we considered entire sequences of recombinant expressed proteins instead of only sequences of target protein in previous works. Altogether, these three SVM-based methods, i.e. flatSVM , nestSVM , and hierSVM , predict each scenario, including a given gene and corresponding fusion vector, as one of the expression levels in SDS-PAGE experiments. Because F 1 measure is a frequently used parameter in a multi-class classification problem, it was employed to compare the performance among three proposed SVM-based methods. Precision-Recall Curve (PRC) and Receiver Operating Characteristic (ROC) were used to compare performance of our methods to previous works. The following sections describe feature extraction and our three proposed methods in more detail.
Before training SVMs, feature extraction was applied to generate fixed length feature vectors from entire recombinant expression regions. Two major steps in recombinant protein productions in E. coli; i.e., transcribing messenger RNAs (mRNAs) of cloning and expression regions, and translating proteins of recombinant fusion vectors, were considered. The major factors were correlated to mRNA expression and stability, codon usage in E. coli, solubility of whole fusion vectors, and Post-Translational Modifications (PTMs) on recombinant proteins. Hence, based on nucleotide and protein levels, entire cloning and expression regions were used to retrieve potential features for predicting transcription efficacy and solubility propensity of recombinant fusion proteins in E. coli. For mRNA expression and stability, 84 k-mer features where k = 1, 2, and 3, along with transcribed mRNA length for each recombinant fusion gene.
Feature index used in this study.
nt Seq Length
Codon Adaptation Index
Wilkinson and Harrison (1991)
Idicula-Thomas et al. (2006)
Plewczynski et al. (2005)
Support Vector Machines (SVMs) are one type of machine learning techniques used for classification and regression originally developed by Vapnik based on the statistical learning theory . SVMs search for a hypothetically unique and optimal hyperplane to distinguish data by maximizing the margin. By cooperating with kernel functions, SVMs map original data that are non-linearly separable in input space into a high-dimensional feature space. In this paper, expression level prediction of recombinant fusion proteins was formulated as a three-class classification problem; i.e., soluble, insoluble, and non-expression. After scaling features generated by feature generation, all features in instance vectors were normalized to zero mean and Standard Deviation (SD) to 1. Here, three SVM-based methods were proposed to deal with the three-class classification problem. With respect to different aspects of considering the hierarchical structure formed by expression levels of recombinant fusion proteins, instance vectors were treated as flat, nested, or hierarchical ones. Three SVM-based methods were named as flatSVM , nestSVM , and hierSVM , respectively. LIBSVM  were used to implement all core algorithms in this research. According to the characteristics of features, the radial basis function (RBF) kernel implemented in LIBSVM was used because of its advantages on dealing with the most cases of numerical data. In the present work, all instance vectors were stratified sampling among three classes. In each class, the same proportion is present in each random partition to divide instances into m parts. In training and validation, k-fold Cross-Validations (CVs) were applied to the m-1 parts of instance vectors. The last part was used as test. The procedure of training and testing was repeated for n times. Finally, performance results of these n repeats were averaged and their corresponding SDs were measured.
According to the hierarchical structure as shown in Figure 1, we treated this binary-tree taxonomy as a flat one. Generally, one-against-one (1vs1) and one-against-the rest (1vsAll) are two commonly used strategies on dealing with multi-class classification problems. However, as reported by Hsu et al., 1vsAll strategy may get a comparable performance as 1vs1 strategy, but it takes much more time on training . Therefore, considering the cost of training time, we decided to use 1vs1 strategy instead of 1vsAll strategy in this work. As mentioned in Feature Generation, instance vectors were derived with 617 features from entire recombinant fusion regions and 726 instance vectors were stratified by labels and divided into ten parts randomly. Along with their corresponding labels of expression levels, 652 instance vectors were used on training and validation by 10-fold CVs. By using three 1vs1 classifiers, the prediction class of an instance vector was determined by a majority voting. The other unseen 74 instance vectors in training and validation were applied to evaluate the performance of trained classifiers. By repeating the same procedure of training and testing in ten times, an average and SD were calculated for these ten CV results. All programs were implemented and associated with LIBSVM package .
Following the procedure of transcribing and translating a recombinant fusion protein in E. coli, the hierarchical structure was divided into two steps. The first step was related to protein expression and the second was associated with the solubility of expressed proteins. First, mRNA expression and stability for a recombinant fusion gene and codon preference in E. coli were the major factors. For second step, solubility related features to test whether an expressed protein could be folded correctly as a soluble one in E. coli were applied. Based on the divide-and-conquer conception, a stepwise method, nestSVM , was proposed to undertake the three-class classification problem by training corresponding classifier for each step. This way, two binary classifiers were trained with distinct sets of features to predict whether a recombinant fusion gene could be expressed and whether an expressed recombinant fusion protein would be soluble in E. coli. For protein expression, a binary classifier was trained to distinguish whether a recombinant fusion gene could be expressed as a protein in E. coli by focusing on features derived from nucleic acid sequences. 84 k-mer frequency features, length, GC-content, and CAI were derived from entire recombinant fusion genes used in first binary classifier. Similar to flatSVM , 652 instance vectors with 87 features were applied to 10-fold CVs on training and validation of the first classifier. However, for the first classifier, instance vectors labelled with soluble and insoluble were treated as one class. Furthermore, for predicting solubility of expressed recombinant fusion proteins in E. coli, the second classifier was trained by the other non-overlapping 530 features in protein level. For a perfect case, all instance vectors of soluble and insoluble used in first step were promoted to train the second classifier. Hence, 207 soluble cases and 212 insoluble cases were used for 10-fold CVs for training and validation. The second binary classifier was mainly used to determine instances between soluble and insoluble proteins. For testing the performance of nestSVM , 74 unseen instance vectors were used to predict protein expression by the first binary classifier. In the second binary classifier, it was applied to instances that were labelled as expression in the first step to predict their protein solubility. The three-class classification problem was tackled by combining two binary classifiers for predicting expression and solubility of recombinant fusion proteins in E. coli respectively.
For our third method, class labels were considered as attribute vectors instead of arbitrary numbers and involved the concept of hierarchical classification method . Because of resource availability constraints, we did not implement the entire algorithm. Instead, we reduced it into a binary SVM classification to fit the public domain tool. The algorithm was described as follows. According to Figure 1, attribute vectors of labels were encoded as <1, 0, 0, 1>, <0, 1, 0, 1>, and <0, 0, 1, 0> to illustrate soluble, insoluble, and non-expression, respectively. Here, the first three digits in attribute vectors were associated to the original labels in the order of soluble, insoluble, and non-expression. The last digit in attribute vectors represented the common parent node of labels between soluble and insoluble proteins in class taxonomy. In other words, for the instance labelled as non-expression, the last digit will be zero for the attribute vector.
In order to reduce a hierarchical SVM classification into a binary classification, subtractions between pairs of attribute vectors were taken to implement the idea. For example, when considering an instance vector with its label as soluble, two new attribute vectors of positive cases were produced by subtracting attribute vectors of insoluble and non-expression from attribute vector of soluble, respectively. Hence, two new attribute vectors of positive cases were <1, -1, 0, 0> and <1, 0, -1, 1>. In other words, subtracting attribute vector of soluble from attribute vectors of insoluble and non-expression, generated two new attribute vectors for negative cases, therefore, two new attribute vectors of negative cases were <-1, 1, 0, 0> and <-1, 0, 1, -1>. By using tensor product ⊗ to cooperate attribute vectors of labels into instance vectors, 617 features in an instance vector were expanded to four times, i.e., 2468 features. Consider an instance vector X, after expanding the four new attribute vectors were generated. <X, -X, 0, 0> and <X, 0, -X, X> were positive cases and <-X, X, 0, 0> and <-X, 0, X, -X> were negative cases. Finally, for training and validation, two positive cases and two negative cases were used. For testing, 6 pairs of subtractions between attribute vectors of labels were applied to predict and averaged to decide the final prediction label. Similar to our other two methods, after a stratified selection, 652 instance vectors were used in training and validation for 10-fold CVs, and then the remaining 74 instances were applied as test cases.
where TP, FP, and FN represent true positive, false positive, and false negative, respectively. Alternatively, the well-known representation of F score for a binary classification is associated with precision and recall, which are denoted as p and r, respectively. Generally, a system with good performance will assign the correct class and only the correct class, by maximizing not only precision but also recall, and then results in maximizing the F score.
where N ij is the number of instances in the test instances, classified correctly (i = 1) or incorrectly (i = 0) by the first classifier, and correctly (j = 1) or incorrectly (j = 0) by the second classifier. After calculation, the range of Q varies from -1 to 1. For statistically independent classifiers, the calculation will be equal to zero. On one hand, the positive value of Q indicates that classifiers tend to identify the same instances correctly. On the other hand, classifiers commit errors on different instances will have a negative Q.
Performance evaluation of six individual classifiers with respect to six vectors.
F 1 measure
0.3509 ± 0.0372
0.4323 ± 0.1515
0.4016 ± 0.0874
0.3746 ± 0.1017
0.3297 ± 0.0695
0.2854 ± 0.0609
Performance evaluation of three proposed methods
F 1 measure
0.7791 ± 0.0606
0.6989 ± 0.0578
0.7466 ± 0.0464
0.7241 ± 0.0287
0.7551 ± 0.0719
0.7068 ± 0.0600
0.7000 ± 0.0498
0.7075 ± 0.0442
0.7833 ± 0.0998
0.7875 ± 0.0747
0.7083 ± 0.0900
0.7000 ± 0.0852
0.7397 ± 0.1027
0.6466 ± 0.0795
0.7015 ± 0.0718
0.7240 ± 0.0567
0.8351 ± 0.0488
0.7865 ± 0.0528
0.8041 ± 0.0320
0.8135 ± 0.0253
Student's t-test in accuracy.
T-test in Accuracy (pvalue)
We used the feature selection package provided in LIBSVM to measure the importance of the features. We found that if we remove those less important features from our feature set, it will result in a lower accuracy. For instance, after doing feature selection and keeping 37 the most important features, the testing accuracy will dramatically decrease from 87.84% to 45.95% in one of the best cross-validation model in flatSVM . Hence, we decided to keep all 617 features to maintain the performance.
Yule's Q-statistic between proposed methods.
In biotechnology perspective, to improve the solubility of target proteins, some laboratory workers may mutate few nucleotides in target genes without affecting their corresponding translational proteins. Moreover, according to the codon preference in E. coli, they synthesize whole nucleotide sequences based on the amino acid sequences. Here, we designed two computational simulations to enhance the solubility of recombinant proteins via our prediction classifier. First, while considering a proper number of mutation sites in biological laboratories, we limited the maximum steps of mutation sites to five. Meanwhile, to effectively reduce the search space, we employed a beam search for narrowing down the search to the top five potential candidates to be soluble forms, which were predicted by our classifier. On the other hand, we designed the second simulation by using preferential codons in E. coli to synthesize entire nucleotide sequences of target proteins. By inputting these synthesized nucleotide sequences into our classifiers, we want to investigate whether any changes in expression levels will occur.
Since the days when Wilkinson and Harrison started applying statistical and computational approaches on studying the solubility prediction for a given protein over-expressed in E. coli , a large number of researchers developed variety of methods on extracting the important factors to affect the solubility of recombinant proteins. In this study, we developed three SVM-based methods of predicting three expression levels based on SDS-PAGE experiments for a target protein in a corresponding vector in E. coli. Unlike most previous works of omitting the cases of no protein expressions in E. coli, this work is the first attempt to tackle a three-class classification problem of distinguishing the expression level for a desired protein in SDS-PAGE experiments. Moreover, according to the observation from our experimental data, a given protein could result in different expression levels when being over-expressed in different vectors in E. coli. Therefore, this work is the instance of encompassing the entire cloning and expression regions. By using our classifiers, the prediction results could help biologists effectively and efficiently choose among different vectors to gain soluble recombinant proteins in E. coli. Additionally, in biotechnology perspective, by mutating few nucleotides or by synthesizing optimal sequences according to the codon preference in E. coli, our prediction methods also provide effective ways to enhance the solubility of target proteins.
The authors wish to thank Chun-Houh Chen, Huai-Kuang Tsai, and Yu-Shi Lin for their valuable insights of this study.
Funding: This work is supported in part by the National Research Program in Genomic Medicine (NRPGM), NSC, Taiwan, under Grant No. NSC98-3112-B-001-026 (Advanced Bioinformatics Core) and NSC96-3112-B-001-013 (Core Facility of Recombinant Protein Production), and in part under Grant No. NSC97-3112-B-010-020 (Advanced Bioinformatics Core:Functional Bioinformatics) and 97-3112-B-010-022 (Advanced Bioinformatics Core).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.