Skip to main content

Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme

Abstract

Background

Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification of BLPs is important for disease diagnosis and biomedical engineering. In this paper, we propose a novel accurate sequence-based method named PredBLP (Prediction of BioLuminescent Proteins) to predict BLPs.

Results

We collect a series of sequence-derived features, which have been proved to be involved in the structure and function of BLPs. These features include amino acid composition, dipeptide composition, sequence motifs and physicochemical properties. We further prove that the combination of four types of features outperforms any other combinations or individual features. To remove potential irrelevant or redundant features, we also introduce Fisher Markov Selector together with Sequential Backward Selection strategy to select the optimal feature subsets. Additionally, we design a lineage-specific scheme, which is proved to be more effective than traditional universal approaches.

Conclusion

Experiment on benchmark datasets proves the robustness of PredBLP. We demonstrate that lineage-specific models significantly outperform universal ones. We also test the generalization capability of PredBLP based on independent testing datasets as well as newly deposited BLPs in UniProt. PredBLP is proved to be able to exceed many state-of-art methods. A web server named PredBLP, which implements the proposed method, is free available for academic use.

Background

Bioluminescence is a special process of chemiluminescence, which is common in many living organisms across the lineages of bacteria, eukaryota and archaea [1]. Bioluminescent proteins (BLPs), with the capability of emitting light by converting chemical energy to light energy, play a critical role in bioluminescence [2, 3]. Employed as highly sensitive labels, they are enormously useful in non-invasive in-vivo biomedical research, such as gene expression analyses [4] and signal transduction pathways [5]. Since BLPs can be easily detected, they are widely used in bioluminescence imaging (tagging biological entities or process), as biosensors for environmental contaminants, and as detectors to map neuronal circuits [6]. Particularly, BLPs can be used for non-invasive analyses of molecular functions in living cells and organisms. With the help of bioluminescence microscopy, scientists can trace and monitor the chemical reaction by quantifying the photon emission of BLPs (such as luciferase) [7]. The quantified visible light provides clues about the location and status of BLPs implanted into tumors or tissues.

Bioluminescence imaging and biosensors are featured by its capability of providing high-sensitive identification of BLPs. However, these methods all suffer several potential problems, which affect the performance of the detection. First, BLPs are sensitive to the microenvironment [8]. For instance, D-Luciferin exhibits the peak spectrum in green region in acidic solution while in red region at basic pH [9]. Second, the vivo organisms largely scatter or absorb the majority regions of the spectrum. Although low temperature can reduce thermal noise, it might also kill the tissues as well as the BLPs. Third, it is difficult to detect light-emitting which is produced inside a living animal without harming its skin. Fourth, light emission is the most significant factor. However, for most bioluminescence signals, they are too weak to detect. Additionally, the filtering to excitation light might affect the corresponding emission light [10]. As a result, the biophysical or biochemical experiments can be benefit from the computational methods which process the characteristic of predicting large amount of data accurately and effectively.

Recent years have witnessed a number of computational methods for predicting BLPs. The earliest study in recognizing BLPs based on computational methods can be traced back to 2011 when Kandaswamy et al. used 544 physicochemical properties and support vector machine to predict BLPs [11]. They also built the first sequence-based predictor named BLProt. Soon after that, Zhao et al. proposed an improved method named BLPre by using evolutionary profiles represented by position specific scoring matrices to construct feature vector [12]. Fan et al. adopted the concept of pseudo amino acid composition to represent proteins and achieved a good prediction quality [13]. Huang et al. introduced the knowledge acquisition method in characterizing BLPs and the evolutionary fuzzy classifier to build prediction model [14]. They also proposed a scoring card method to estimate the propensity scores of dipeptides and amino acids as well as design prediction models [15]. Nath et al. adopted oversampling technique and unsupervised K-means algorithm for predicting BLPs [16].

In summary, these methods provide important clues in this field. Some of them provide web servers or programs. These prediction tools help biologists to fast predict potential BLPs and promote the development of this field. However, as far as we have concerned, there are two aspects that need to be further investigated. First, most of these studies used various types of features to encode the proteins (or the samples). However, they lacked detailed analyses or descriptions of the features. That is, it is uncertain about the discrimination capability of these features. Second, most of these studies only considered general BLPs. In other words, they didn’t consider the differences across different lineages of BLPs. Actually, based on our research, these existed differences are valuable for deep investigation. They are expected to further promote the accuracy of the prediction models. However, it has not yet received enough attention.

Motivated by the above-mentioned two drawbacks, in this study, we focus on the challenge of proposing a novel accurate predictor for identifying BLPs based on sequence-derived features. We collect and compile four new datasets (one general and three lineage-specific datasets), which contain non-redundant BLPs and non-BLPs. Next, a series of sequence-derived features, which have been proved to be involved in BLPs, are mathematically computed to encode the proteins. Detailed analyses are performed to empirically show the differences between BLPs and non-BLPs, especially across lineage-specific BLPs. Then, these differences are used to discriminate BLPs against non-BLPs. For the convenience of biology researchers, our method has been implemented as a user-friendly web server named PredBLP (Prediction of BioLuminescent Proteins), which is free available at http://www.inforstation.com/PredBLP/.

Methods

Datasets

In this work, we construct four datasets, which include one general and three lineage-specific datasets, for the investigation of BLPs. The three considered lineages include bacteria, eukaryota and archaea. All these datasets are compiled from 17,403 collected BLPs from UniProt (Jul. 2016) [17]. Since the existence of homologous would lead to the bias of the modeling and predicting processes, we further use BLASTClust [18] to cluster all these proteins with a cut-off of 30%. We choose BLASTClust because it is capable of clustering sequences with low similarity as well as long sequences. Next, we randomly pick one protein from each cluster as the representative. Finally, we obtain 863 BLPs (positive samples). Among these BLPs, 748 belong to bacteria, 70 belong to eukaryota and 45 belong to archaea. Additionally, we also collect 7093 non-redundant non-BLPs (negative samples) to construct the negative samples. Among them, 4919, 1426 and 748 proteins are affiliated with bacteria, eukaryota and archaea respectively. We randomly pick 80% of positives and equal number of negatives in each dataset for balanced training. The rest samples are used for independent testing. Detailed information of these newly compiled datasets can be found in our web server.

To fairly compare our proposed method with previous studies [11,12,13, 15, 16, 19], we also introduce Kandaswamy’s [11] training dataset. The BLPs in Kandaswamy’s training dataset were selected from Pfam database [20]. Then, they used CD-HIT [21] to remove redundant proteins with more than 40% sequence similarity.

The construction of feature vector

The features of amino acid composition

As the principal fundamental elements of the proteins, amino acid composition (AAC) provides useful clues in protein structure and function. The features of AAC are widely used in bioinformatics [22,23,24]. In this work, the features of AAC for each type of BLPs, including the general BLPs and three lineages of BLPs, are calculated by:

$$ {f}_{AA C}(i)=\frac{AA_i}{L} $$
(1)

where AA i (i {1, 2, 3,  … , 20}) represents i-th type of amino acids, and L indicates the length of the query protein. Finally, we quantify the composition of 20 amino acids in the query protein.

The features of dipeptide composition

Previous studies have proved that dipeptide composition (DC) plays important roles in protein structure and function, such as vivo activity and protein thermo stability [25]. Hereby, the features of DC then can be formulated as:

$$ {f}_{DC}\left( i, j\right)=\frac{\sum_{n=1}^{L-1}{AA}_n{AA}_{n+1}\to {AA}_i{AA}_j}{L-1} $$
(2)

where AA i AA j (i, j {1, 2, 3,  … , 20}) represents 400 types of dipeptide, n indicates the position of n-th residues in the query protein with the length of L residues. AA n AA n + 1 → AA i AA j denotes the dipeptide AA n AA n + 1 in the query protein is same as AA i AA j in the 400 dipeptides. f DC (i, j) quantifies the frequencies of dipeptides using a straightforward statistical approach.

The features of sequence motifs

Sequence motifs (MTF) in protein sequences always indicate the conserved regions [26]. Although many similarities for proteins in the same family may disappear after long-standing evolution, some inherited attributes still exist because they are functionally or structurally related signals [27]. These signals help to control the cellular localization regions and corresponding biochemical functions [28]. Thus, in this study, we introduce information theory to compute the features of MTF that are more favorable to BLPs against non-BLPs. We first calculate the original information entropy of BLPs and non-BLPs. Then, we iteratively generate a l-length pattern P from “AXA” to “VXV” (“X” denote random amino acid(s)). For each pattern P, we calculate its occurrence frequencies in BLPs and non-BLPs. If its frequency in BLPs is larger than the minimal preset occurrence frequency threshold T (in this study, we preset T = 10%), we use this pattern P to reclassify samples and calculate the updated information entropy. Then, we compare the original information entropy with the updated one, and generate corresponding information gains of the considered P. Next, we calculate the difference of these two information gains (DIG). The higher the difference is, the more discriminatory the pattern is. The pseudo-code of the aforementioned procedure is shown in Fig. 1.

Fig. 1
figure 1

The pseudo-code of the calculation of motifs

In this work, we choose the top 10 motifs which are sorted by descending order of DIG values. Next, we create a 10-dimensional binary vector to denote whether or not a query protein contains the considered 10 motifs. We use the number ‘1’ to represent the positive and ‘0’ to indicate the negative.

The features of physicochemical properties

Since amino acids serve as building blocks of proteins, the physiochemical properties (PCP) of amino acids influence the microscopic environment, which includes surface motions, energy, and dynamics [23, 27, 29]. In this part, we further investigate several properties related to BLPs.

Alipour et al. found that the insertion and substitution of positively-charged residues effect the light shift mechanism [30]. Li et al. proved that the hydrophobicity in active site determines the activity of BLPs [31]. Moradi et al. pointed out that the change in polarity of the emitter site of BLPs lead to the modulation of the bioluminescence color [32, 33]. Particularly, the movement of flexible loop in BLPs usually concomitantly changes the polarity of the emitter site [32]. For instance, if a bulge appears in a flexible loop, the emission lights shifts color from green to red. With the help of energy acceptors, the energy transfer changes the bioluminescence intensity as well as effects the spectral shifts [34, 35]. Silva et al. stated that the increase in polarity causes a decrease in the emission energies. They also provide the evidences that the change of solvent and pH affect the structural and electronic properties of BLPs [36]. Considering this, we collect the physicochemical properties, which include hydrophobicity [31], hydrophilicity [37], polarity [38], polarizability [39], transfer free energy [40], solvent contact area [41], positively-charge [42], flexibility [43] and protein kinase A [44]. Given a query protein, its features of PCP are calculated as follows:

$$ {f}_{PCP}(i)=\frac{PCP_i- \min \left(\frac{1}{L}\sum_{j=1}^L{PCP}_{i, j}\right)}{ \max \left(\frac{1}{L}\sum_{j=1}^L{PCP}_{i, j}\right)- \min \left(\frac{1}{L}\sum_{j=1}^L{PCP}_{i, j}\right)} $$
(3)

where i representes the i-th PCP and j indicates the j-th amino acid in the query protein with the length of L residues. Detailed information of these properties is provided in Additional file 1: Table A1.

Feature selection strategy

The combination of various types of features could provide more useful information in constructing a model [22]. However, the existence of irrelevant features (noisy features) or redundant features may potentially deteriorate the prediction quality of the predictor. In view of this, we adopt Fisher-Markov Selector [45] together with Sequential Backward Selection [46] to perform feature selection. The Fisher-Markov Selector is a typical filter method. It uses Markov random fields to achieve the exact global optimization in calculating the correlation coefficients between features and labels. The output of Fisher-Markov Selector is a list of ranked features according to the calculated coefficients. Next, Sequential Backward Selection strategy is introduced by iteratively removing the least irrelevant features. A feature will never be considered once it is eliminated. The iteration stops until the elimination of features cannot achieve better results. At that time, the remaining features construct the optimal feature subsets.

Model construction and performance evaluation

Support vector machine (SVM) [47] has been proved to be a powerful machine learning algorithms [48]. It is widely used to construct prediction model in predicting protein structures and functions [12, 27, 49]. In this study, we use LIBSVM (version 3.20) [50] to train model and perform the prediction. The radial biases function is used as the kernel function and the grid search is adopted to find the optimal parameters and optimize SVM model. Shown in Fig. 2 is the flowchart of our proposed method.

Fig. 2
figure 2

The flowchart of the proposed method

We assess our method using two statistical methods, namely k-fold cross-validation and the independent test. For the k-fold cross-validation, the samples in training dataset are divided into k equal subsets. In each iteration, k-1 subsets are used as training data to train the model and the remaining one is used as the validation data to test the model. This procedure repeats k times, and the final performance is measured by averaging the results of k iterations. For the independent test, the samples in the testing dataset are independent from those in the training dataset. The model that trained in the training dataset is used to predict testing datasets. In this study, the binary-based criteria, including accuracy, sensitivity, specificity and Matthew’s Correlation Coefficients (MCC) are used to evaluate the methods which output binary predictions.

$$ Accuracy=\frac{TP+ TN}{TP+ FP+ TN+ FN} $$
(4)
$$ Sensitivity=\frac{TP}{TP+ FN} $$
(5)
$$ Specificity=\frac{TN}{TN+ FP} $$
(6)
$$ MCC=\frac{TP\times TN- FP\times FN}{\sqrt{\left( TP+ FP\right)\left( TN+ FN\right)\left( TP+ FN\right)\left( TN+ FP\right)}} $$
(7)

where TP, TN, FP, FN indicate the true positives (correctly predicted as BLPs), true negatives (correctly predicted as non-BLPs), false positives (incorrectly predicted as BLPs) and false negatives (incorrectly predicted as non-BLPs), respectively. In the case that the prediction probability is available, we introduce score-based metric for assessing the methods that produce predicted propensities. Similar to other methods, we also report AUC values, which stands for the area under the ROC (Receiver Operating Characteristic) curve.

Results and discussion

The characteristics of the extracted features

In this work, we construct the feature space based on multiple types of features including AAC, DC, MTF and PCP. Before putting them into operation, we examine their characteristics on BLPs and non-BLPs.

To investigate the amino acid preference of BLPs, we calculate the features of AAC for BLPs and non-BLPs respectively. Illustrated in Fig. 3 is the relative amino acid composition of BLPs against non-BLPs in four datasets. Generally, compared with the non-BLPs, BLPs are enriched with charged residues. This phenomenon keeps consistent in bacteria and archaea BLPs. Moreover, bacteria BLPs are enriched with buried and depleted with acidic amino acids; eukaryota BLPs are enriched with aliphatic and aromatic amino acids; and archaea BLPs are enriched with acyclic and cyclic amino acids and depleted with aliphatic amino acids. We also find that the relative differences on eukaryota BLPs against non-BLPs are relative lower than those on bacteria and archaea BLPs. Detailed data of their values is provided in Additional file 1: Table A2. We empirically demonstrate that amino acid compositions with relative difference higher than 0.25% are discriminatory.

Fig. 3
figure 3

The relative amino acid composition of BLPs against non-BLPs on four datasets

Illustrated in Fig. 4 is the relative dipeptide composition of BLPs against that of non-BLPs in four datasets. Red block indicates the discriminatory enriched dipeptides, while green one represents the opposite. The deeper the color is, the more significant the enrichment/depletion is. Generally, BLPs show high preference with A-, R-, P- and G-related dipeptide, which keeps consistent with those in bacteria and archaea BLPs. For general BLPs, the ‘A-A’, ‘A-R’, ‘R-A’ and ‘R-L’ dipeptides show over-represented than normal level. In eukaryota BLPs, I- and G-related dipeptides are more favored. Moreover, the K-related dipeptides are under-represented on both C-terminal and N-terminal side in general BLPs and three lineages of BLPs. The motifs can also be used to further discriminate various lineages.

Fig. 4
figure 4

The relative dipeptide composition of BLPs against that of non-BLPs in four datasets. The x-axis indicates the amino acids which are cleaved on the C-terminal side; while y-axis stands for the N-terminal side. Detailed data of their values is provided in Additional file 1: Tables A3-A6

Table 1 lists the top 10 selected motifs according to the descending order of DIG values, which mathematically indicate the relative distinguish capability of various motifs. The higher the DIG value is, the more quantified difference the motif exists in BLPs against non-BLPs.

Table 1 Selected top 10 motifs according to the descending order of DIG values

Illustrated in Fig. 5 are the median based box plots for the considered nine physicochemical properties. We notice that the variability of these properties in BLPs is overall much lower than that for non-BLPs. For instance, the values of polarity, positively charge and flexibility in general BLPs are less volatile than those in non-BLPs. This phenomenon keeps consensus in three lineage-specific datasets. Additionally, eukaryota BLPs are more flexible than bacteria and archaea BLPs. O’Brien et al. pointed out that broad dynamic range and stable signals in eukaryota BLPs are the reasons for the increased flexibility [51].

Fig. 5
figure 5

Basal levels of selected physicochemical and biological properties in four datasets. Midline, box boundaries, and whiskers indicate median, quartiles, and 10th and 90th percentiles. The x-axis indicates the normalized values; and y-axis stands for twelve properties. In this work, a physicochemical property is empirically regarded to be discriminatory provided that the overlap of two boxes is less than 80% of either box

The study on the direct feature combination

From the perspective of machine learning, the combination of various types of features usually produces better performance than individual features do. Therefore, we test the effectiveness of individual features as well as different combinations of features. Hereby, we adopt five-fold cross-validation on the training dataset. The final results are reported by calculating the average value and standard deviations of five experiments.

As shown in Table 2, four types of features all give out promising prediction results. The features of DC produce the highest MCC (0.650 ± 0.029) and AUC (0.830 ± 0.016) among four individual features. Generally, the combination of two types of features shows higher accuracy of prediction, which is also true for three when compared with two. The combination of four types of features achieves the best performance with MCC = 0.676 ± 0.010 and AUC = 0.850 ± 0.006. This experiment proves the effectiveness of proposed features, and further indicates that the combination of different types of features can produce a promising result. Similar experiments on the other three training sets are provided in Additional file 1: Table A7.

Table 2 The experimental results of various individual and combinative features on the training set for general BLPs

The performance of feature selection scheme

Although the combination of different types of features can improve the prediction accuracy, some noisy data would be also added in the feature vector. Here, we decide to select the optimal feature subset. As stated in Feature selection strategy, we first use Fisher-Markov Selector to calculate correlation coefficients of different features (Additional file 1: Figure A1). Next, we adopt Sequential Backward Selection strategy to select the optimal classifier and corresponding optimal feature subset. Finally, we obtain 199 features on the general BLPs dataset, and 174, 204 and 129 features on three lineages-specific BLPs datasets respectively (Table 3). Based on the optimal feature subset, the classifier on general BLPs achieves the MCC of 0.698 ± 0.018 and AUC of 0.883 ± 0.007, which are 0.022 (or 3.3%) and 0.033 (or 3.9%) higher than that based on the complete features. Three lineages-specific models also show similar increase in the prediction accuracy.

Table 3 The performance of optimum feature subsets on four training sets using five-fold cross-validation

In section “The characteristics of the extracted features”, we detailedly characterize the intrinsic differences across general BLPs, bacteria BLPs, eukaryota BLPs, archaea BLPs and non-BLPs. After that, we perform the feature selection. Then the calculated optimal feature subset (Additional file 1: Table A8) is used to train the model. To check whether these differences are still kept after the feature selection, we further investigate the composition of the optimal feature subset.

Figure 6 shows the overlap between the discriminatory and selected useful features, respectively. Among the 12 discriminatory features within AAC, 6 (or 50%) are selected in the optimal feature subset. More importantly, it occupies 85.7% (6/7 = 85.7%) of the selected AAC. The overlap is even higher for the features of DC, MTF and PCP. We notice that all discriminatory features are successfully selected during the feature selection procedure. This suggests that the calculated differences could be valuable in distinguishing BLPs from non-BLPs. The Venn diagrams for three lineage-specific optimal subsets are illustrated in Additional file 1: Figure A2. We see the similar results from Additional file 1: Figure A2. Mathematically, the existent differences in the features help the classifier to discriminate samples. That’s why the fraction of overlap is as high as expected. In follow-up experiments, we use the optimal feature subset to train universal model or lineage-specific ones.

Fig. 6
figure 6

Venn diagrams of the overlap (green zone) between the discriminatory (orange pie) and selected useful features (blue pie) in the optimal subset for each type of features. D indicates the discriminatory features and s stands for selected useful features

Comparison of lineage-specific scheme with traditional universal approach

As stated in section “The characteristics of the extracted features”, we find that BLPs in different lineages have various attributes on our considered features. These various attributes can be used to further improve the prediction performance by introducing lineage-specific scheme. Considering this, we design three lineage-specific classifiers in addition to traditional universal one. In this section, we evaluate the effectiveness of this scheme.

Table 4 compares the performance between lineage-specific models (PredBLP-B, PredBLP-E and PredBLP-A) and universal models (PredBLP-U) on three training datasets. The lineage-specific models improve the average AUC values of 0.048 ~ 0.136 (or 5.5% ~ 20.3%) when compared with universal ones. We also notice that this scheme performs the best for eukaryota BLPs, which corresponds to the investigation of the differences across various lineages. To sum up, the experimental results demonstrate the effectiveness of the lineage-specific scheme.

Table 4 Comparison of lineage-specific models with traditional universal models on three training sets using five-fold cross-validation

Comparison with other methods on Kandaswamy’s training dataset

To test the robustness of our method as well as perform fair evaluation with previous studies [11, 12, 15, 16, 19], we also introduce Kandaswamy’s training dataset [11]. Next, we compare our method with BLProt [11], BLPre [12], Fan’s method [13], SCMBLP [15], BLKnn [19] and Nath’s method [16]. The results of these methods on Kandaswamy’s training dataset are directly obtained from their reports. Since the Kandaswamy’s training dataset does not particularly annotate the lineage of BLPs, we use the traditional universal approach to build the prediction model (PredBLP-U). Since all these methods use different way to under-sample Kandaswamy’s dataset, the potential bias may exist in the process of sampling. Considering this, we repeat the under-sampling procedure for 10 times and report the corresponding average results.

As shown in Table 5, all methods produce good results with sensitivity > 0.7, specificity > 0.9 and AUC > 0.85. We notice that all methods achieve good predictions. It should be noted that Kandaswamy et al. used CD-HIT [18] to remove redundant proteins with more than 40% sequence similarity. Actually, a common rule is that two sequences are homologous if they are more than 30% identical over their entire lengths [52]. The existence of homology proteins results in a relative easy dataset for each method. Our method also shows promising results with sensitivity = 0.912 ± 0.014 and specificity = 0.962 ± 0.017. PredBLP-U yields an AUC value of 0.968 ± 0.009, which is slightly lower than that of Nath’s method. Among these considered predictors, Nath’s method gives out the highest AUC of 0.991. Our PredBLP-U achieves the highest MCC value (0.849 ± 0.019) and second highest AUC value (0.968 ± 0.009).

Table 5 Comparison of the proposed PredBLP-U with previous methods on Kandaswamy’s training dataset

Comparison with other predictors on independent testing datasets

In order to test the generalization capability of our method, we further test PredBLP on four independent testing sets including general BLPs and three lineages of BLPs. Here, we compare our method with BLProt [11] and SCMBLP [15] because the rest predictors were either no longer maintained or unavailable. Meanwhile, we also test the universal model and lineage-specific models on three lineages of BLPs. First, we random picked 80% BLPs and 80% non-BLPs from our independent dataset. Next, we use these proteins to evaluate BLProt and SCMBLP. We repeat this procedure for 10 times to avoid potential bias in under sampling. Finally, we calculate the statistic differences of MCC values between among the stat-of-art predictors.

Table 6 summarizes the prediction results of stat-of-art predictors on independent datasets. Since BLProt and SCMBLP were all constructed based on general BLPs, we compare our universal model with these two predictors. In general, our predictor produce promising results with the mean AUC > 0.75. Moreover, three lineage-specific predictors all outperform corresponding universal ones, which empirically prove the effectiveness of the lineage-specific scheme. These results prove the good generalization capability of our method as well as the effectiveness of using lineage-specific strategy. To test if the improvement is statistically significant, we firstly use Shapiro-Wilk test [53] to check whether the data are normal. If it follows a normal distribution, we use student’s t-test [54]; otherwise, we use Wilcoxon signed-rank test [55]. A p-value less than 0.05 indicates the difference is statistically significant. This experiment demonstrates the improvement of our method is significant when compared with other predictors. Additionally, we are able to demonstrate our PredBLP significantly outperform the other predictors.

Table 6 Comparison of PredBLP with other methods on the independent testing dataset

Application to newly deposited BLPs in UniProt

The computational tools are often used to identify unknown proteins in real-life. Considering this, we collect BLPs that were deposited from August 2016 to February 2017 in UniProt. Next, we build four types of datasets, including general BLPs together with bacteria, eukaryota, and archaea BLPs. We random pick 80% BLPs as the testing dataset and repeat this procedure for 10 times as stated in section “Comparison with other predictors on independent testing datasets”. Here, we compare our webserver PredBLP with BLProt and SCMBLP. To achieve a fair comparison, we use the default parameters for these three predictors.

As listed in Table 7, for general BLPs, the proposed PredBLP-U correctly identify about 90% BLPs, which is 10% more than that for SCMBLP and BLProt. The p-value indicates the improvement is statistically significant. We see the similar results for bacteria BLPs and archaea BLPs. Especially, the lineage-special models all perform better results than that of the universal model. Both SCMBLP and PredBLP recognize more than 95% of archaea BLPs. However, although the lineage-specific model gives out higher results, the improvements are not statistically significant than that of universal model. The limited number of archaea BLPs could be the reason that account for this.

Table 7 Comparison of PredBLP with other methods on newly deposited BLPs

Conclusion

In this study, we propose a novel predictor for the identification of BLPs by using sequence-derived features and lineage-specific scheme. Experiment on benchmark datasets proves the robustness and effectiveness of our method. We ascribe the good performance of the proposed method to three aspects. First, we collect the features which are capable to reflect the intrinsic properties of BLPs against non-BLPs. These features are also capable to distinguish various lineages of BLPs. Second, the effectiveness of the feature selection procedure. We successfully select the majority of the informative features as well as remove noisy features. Third, the introduction of lineage-specific strategy, which is proved to be more powerful than traditional universal approaches. Actually, the lineage-specific strategy is firstly introduced in this field. It is featured by characterizing the BLPs in a more specific way. The prediction performance on independent testing dataset and newly deposited BLPs in UniProt demonstrates that our method has a good generalization capability and is capable to exceed many state-of-art methods. Additionally, we empirically show that our predictor would be competitive when compared with currently public predictors.

Abbreviations

AAC:

Amino acid composition

AUC:

Area under receiver operating characteristic curve

DC:

Dipeptide composition

DIG:

Difference in information gain

IG:

Information gain

MCC:

Matthew’s correlation coefficients

PCP:

Physicochemical properties

pKa:

Protein kinase A

PredBLP:

Prediction of bioluminescent proteins

PredBLP-A:

Archaea-specific model of PredBLP

PredBLP-B:

Bacteria-specific model of PredBLP

PredBLP-E:

Eukaryota-specific model of PredBLP

PredBLP-U:

Universal model of PredBLP

ROC:

Receiver operating characteristic

References

  1. Widder EA. Bioluminescence in the ocean: origins of biological, chemical, and ecological diversity. Science. 2010;328(5979):704–8.

    Article  CAS  PubMed  Google Scholar 

  2. Rowe L, Dikici E, Daunert S. Engineering bioluminescent proteins: expanding their analytical potential. Anal Chem. 2009;81(21):8662–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Mirasoli M, Michelini E. Analytical bioluminescence and chemiluminescence. Anal Bioanal Chem. 2014;406(23):5529.

    Article  CAS  PubMed  Google Scholar 

  4. Contag CH, Bachmann MH. Advances in in vivo bioluminescence imaging of gene expression. Annu rev Biomed eng. 2002;4(1):235–60.

    Article  CAS  PubMed  Google Scholar 

  5. Burgos J, Rosol M, Moats R, Khankaldyyan V, Kohn D, Nelson M Jr, et al. Time course of bioluminescent signal in orthotopic and heterotopic brain tumors in nude mice. BioTechniques. 2003;34(6):1184–8.

    CAS  PubMed  Google Scholar 

  6. Navizet I, Liu YJ, Ferre N, Roca-Sanjuán D, Lindh R. The chemistry of bioluminescence: an analysis of chemical functionalities. ChemPhysChem. 2011;12(17):3064–76.

    Article  CAS  PubMed  Google Scholar 

  7. Hosseinkhani S. Molecular enigma of multicolor bioluminescence of firefly luciferase. Cell Mol Life Sci. 2011;68(7):1167–82.

    Article  CAS  PubMed  Google Scholar 

  8. Kheirabadi M, Sharafian Z, Naderi-Manesh H, Heineman U, Gohlke U, Hosseinkhani S. Crystal structure of native and a mutant of Lampyris turkestanicus luciferase implicate in bioluminescence color shift. Biochim Biophys Acta-Proteins Proteomics. 2013;1834(12):2729–35.

    Article  CAS  Google Scholar 

  9. Erez Y, Presiado I, Gepshtein R, da Silva Ls P, Esteves da Silva JC, Huppert D. Comparative study of the photoprotolytic reactions of D-luciferin and oxyluciferin. J Phys Chem a. 2012;116(28):7452–61.

    Article  CAS  PubMed  Google Scholar 

  10. Sternberg C, Eberl L, Poulsen LK, Molin S. Detection of bioluminescence from individual bacterial cells: a comparison of two different low-light imaging systems. J Biolumin Chemilumin. 1997;12(1):7–13.

    Article  CAS  PubMed  Google Scholar 

  11. Kandaswamy KK, Pugalenthi G, Hazrati MK, Kalies K-U, Martinetz T. BLProt: prediction of bioluminescent proteins based on support vector machine and relieff feature selection. BMC Bioinformatics. 2011;12(1):1.

    Article  Google Scholar 

  12. Zhao X, Li J, Huang Y, Ma Z, Yin M. Prediction of bioluminescent proteins using auto covariance transformation of evolutional profiles. Int J Mol Sci. 2012;13(3):3650–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Fan G-L, Li Q-Z. Discriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou's pseudo amino acid composition. J Theor Biol. 2013;334:45–51.

    Article  CAS  PubMed  Google Scholar 

  14. H-l H, Lee H-c, Charoenkwan P, W-l H, L-s S, Ho S-Y. Interpretable knowledge acquisition for predicting bioluminescent proteins using an evolutionary fuzzy classifier method. Training. 2014;300:300.

    Google Scholar 

  15. Huang H-L. Propensity scores for prediction and characterization of bioluminescent proteins from sequences. PLoS One. 2014;9(5):e97158.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Nath A, Subbiah K. Unsupervised learning assisted robust prediction of bioluminescent proteins. Comput Biol med. 2016;68:27–36.

    Article  CAS  PubMed  Google Scholar 

  17. UniProt Consortium. UniProt: a hub for protein information. Nucleic acids research, 2015, 43(Database issue):D204–212.

  18. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids res. 1997;25(17):3389–402.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Hu J. BLKnn: A K-nearest neighbors method for predicting bioluminescent proteins. In: Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE conference on: 2014. IEEE: 1-6.

  20. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer EL. Pfam: the protein families database. Nucleic acids research. 2014;42(D1):D222–30.

    Article  CAS  PubMed  Google Scholar 

  21. Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282–3.

    Article  CAS  PubMed  Google Scholar 

  22. Peng Z, Kurgan L. High-throughput prediction of RNA, DNA and protein binding regions mediated by intrinsic disorder. Nucleic Acids res. 2015;43(18):e121.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Mizianty MJ, Kurgan L. Sequence-based prediction of protein crystallization, purification and production propensity. Bioinformatics. 2011;27(13):i24–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Zhang J, Kurgan L. Review and comparative assessment of sequence-based predictors of protein-binding residues. Brief Bioinform. 2017, bbx022.

  25. Chen K, Jiang Y, Du L, Kurgan L. Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs. J Comput Chem. 2009;30(1):163–72.

    Article  PubMed  Google Scholar 

  26. Zhang J, Chai H, Gao B, Yang G, Ma Z. HEMEsPred: Structure-based Ligand-specific Heme Binding Residues Prediction by Using Fast-adaptive Ensemble Learning Scheme. IEEE/ACM Trans Comput Biol Bioinform. 2016, PP(99):1–1.

  27. Zhang J, Gao B, Chai H, Ma Z, Yang G. Identification of DNA-binding proteins using multi-features fusion and binary firefly optimization algorithm. BMC Bioinformatics. 2016;17(1):323.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Kim SB, Otani Y, Umezawa Y, Tao H. Bioluminescent indicator for determining protein-protein interactions using intramolecular complementation of split click beetle luciferase. Anal Chem. 2007;79(13):4820–6.

    Article  CAS  PubMed  Google Scholar 

  29. Guo S, Liu C, Zhou P, Li Y. A Multifeatures fusion and discrete firefly optimization method for prediction of protein tyrosine Sulfation residues. Biomed res Int. 2016;2016:8151509.

    PubMed  PubMed Central  Google Scholar 

  30. Alipour BS, Hosseinkhani S, Ardestani SK, Moradi A. The effective role of positive charge saturation in bioluminescence color and thermostability of firefly luciferase. Photochem Photobiol Sci. 2009;8(6):847–55.

    Article  Google Scholar 

  31. Li C-H, Tu S-C. Active site hydrophobicity is critical to the bioluminescence activity of Vibrio Harveyi luciferase. Biochemistry. 2005;44(39):12970–7.

    Article  CAS  PubMed  Google Scholar 

  32. Moradi A, Hosseinkhani S, Naderi-Manesh H, Sadeghizadeh M, Alipour BS. Effect of charge distribution in a flexible loop on the bioluminescence color of firefly luciferases. Biochemistry. 2009;48(3):575–82.

    Article  CAS  PubMed  Google Scholar 

  33. Hirano T, Hasumi Y, Ohtsuka K, Maki S, Niwa H, Yamaji M, et al. Spectroscopic studies of the light-color modulation mechanism of firefly (beetle) bioluminescence. J am Chem Soc. 2009;131(6):2385–96.

    Article  CAS  PubMed  Google Scholar 

  34. Kudryasheva NS. Bioluminescence and exogenous compounds: Physico-chemical basis for bioluminescent assay. J Photochem Photobiol B Biol. 2006;83(1):77–86.

    Article  CAS  Google Scholar 

  35. Roda A, Mirasoli M, Michelini E, Di Fusco M, Zangheri M, Cevenini L, et al. Progress in chemical luminescence-based biosensors: a critical review. Biosens Bioelectron. 2016;76:164–79.

    Article  CAS  PubMed  Google Scholar 

  36. Pinto da Silva L, Esteves da Silva JC. Computational investigation of the effect of pH on the color of firefly bioluminescence by DFT. ChemPhysChem. 2011;12(5):951–60.

    Article  CAS  PubMed  Google Scholar 

  37. Sellenet PH, Allison B, Applegate BM, Youngblood JP. Synergistic activity of hydrophilic modification in antibiotic polymers. Biomacromolecules. 2007;8(1):19–23.

    Article  CAS  PubMed  Google Scholar 

  38. Iden S, Collard JG. Crosstalk between small GTPases and polarity proteins in cell polarization. Nat rev Mol Cell Biol. 2008;9(11):846–59.

    Article  CAS  PubMed  Google Scholar 

  39. Kim B, Young T, Harder E, Friesner RA, Berne BJ. Structure and dynamics of the solvation of bovine pancreatic trypsin inhibitor in explicit water: a comparative study of the effects of solvent and protein polarizability. J Phys Chem B. 2005;109(34):16529–38.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Liu Y, Bolen D. The peptide backbone plays a dominant role in protein stabilization by naturally occurring osmolytes. Biochemistry. 1995;34(39):12884–91.

    Article  CAS  PubMed  Google Scholar 

  41. Samanta U, Bahadur RP, Chakrabarti P. Quantifying the accessible surface area of protein residues in their local environment. Protein eng. 2002;15(8):659–67.

    Article  CAS  PubMed  Google Scholar 

  42. FAUCHÈRE JL, Charton M, Kier LB, Verloop A, Pliska V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Chem Biol Drug des. 1988;32(4):269–78.

    Google Scholar 

  43. Vihinen M, Torkkila E, Riikonen P. Accuracy of protein flexibility predictions. Proteins. 1994;19(2):141–9.

    Article  CAS  PubMed  Google Scholar 

  44. Bhaskaran R, Ponnuswamy P. Positional flexibilities of amino acid residues in globular proteins. Chem Biol Drug des. 1988;32(4):241–55.

    CAS  Google Scholar 

  45. Cheng Q, Zhou H, Cheng J. The fisher-markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data. IEEE Trans Pattern Anal Mach Intell. 2011;33(6):1217–33.

    Article  PubMed  Google Scholar 

  46. Pudil P, Novovičová J, Kittler J. Floating search methods in feature selection. Pattern Recogn Lett. 1994;15(11):1119–25.

    Article  Google Scholar 

  47. Gunn SR. Support vector machines for classification and regression. ISIS Tech Rep. 1998;14

  48. Burges CJ. A tutorial on support vector machines for pattern recognition. Data min Knowl Disc. 1998;2(2):121–67.

    Article  Google Scholar 

  49. Scott D, Dikici E, Ensor M, Daunert S. Bioluminescence and its impact on bioanalysis. Annu rev Anal Chem. 2011;4:297–319.

    Article  CAS  Google Scholar 

  50. Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Tech (TIST). 2011;2(3):27.

    Google Scholar 

  51. O'Brien MA, Moravec RA, Riss TL, Bulleit RF. Homogeneous, bioluminescent proteasome assays. Methods Mol Biol. 2015;1219:95–114.

    Article  PubMed  Google Scholar 

  52. Pearson WR. An introduction to sequence similarity (“homology”) searching. Curr Protoc Bioinformatics. 2013, Chapter 3:Unit3.1.

  53. Razali NM, Wah YB. Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests. J Statistical Model Anal. 2011;2(1):21–33.

    Google Scholar 

  54. Haynes W. Student’s t-test. In: Encyclopedia of Systems Biology. New York: Springer; 2013:2023-2025.

  55. Rey D, Neuhäuser M. Wilcoxon-signed-rank test. In: International encyclopedia of statistical science. Berlin Heidelberg: Springer; 2011: 1658-1659.

Download references

Acknowledgments

We thank the Fundamental Research Funds for the Central Universities (Northeast Normal University), the National Natural Science Foundation of China and the Natural Science Foundation of Jilin Province for proving the funding for this work.

Funding

This work is supported by the Fundamental Research Funds for the Central Universities (Grant No. 14ZZ2240), the National Natural Science Foundation of China (Grant No. 61603087), and also funded by the Natural Science Foundation of Jilin Province (Grant No. 20160101253JC).

Availability of data and materials

All the supplementary tables and figures used in this research can be found in the Additional file 1 associated with this paper. The datasets used in this research are available at http://www.inforstation.com/PredBLP/.

Authors’ contributions

JZ conceived the idea and designed the experiments. HTC compiled the datasets, and optimized the method. GFY implemented the web server. ZQM supervised the progress of the whole project. JZ and HTC drafted the first version of the manuscript. All authors have read and approved the final manuscript.

Authors’ information

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiqiang Ma.

Additional file

Additional file 1:

Table A1. Physicochemical properties for twenty amino acids. Table A2. The relative amino acid composition of BLPs. Table A3. The relative dipeptide composition of general BLPs. Table A4. The relative dipeptide composition of bacteria BLPs. Table A5. The relative dipeptide composition of eukaryota BLPs. Table A6. The relative dipeptide composition of archaea BLPs. Table A7. The performance of different features and their combinations on three training sets using five-fold cross-validation. Table A8. The lists of optimum feature subsets in four training sets. Figure A1. An overview of the importance of the features in four training sets. Figure A2. Venn diagrams of the overlap between the discriminatory and selected useful features in the optimal subset for each type of features. (DOCX 1741 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Chai, H., Yang, G. et al. Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme. BMC Bioinformatics 18, 294 (2017). https://doi.org/10.1186/s12859-017-1709-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-017-1709-6

Keywords