A machine learning approach for the identification of odorant binding proteins from sequence-derived properties
© Pugalenthi et al. 2007
Received: 08 May 2007
Accepted: 19 September 2007
Published: 19 September 2007
Skip to main content
© Pugalenthi et al. 2007
Received: 08 May 2007
Accepted: 19 September 2007
Published: 19 September 2007
Odorant binding proteins (OBPs) are believed to shuttle odorants from the environment to the underlying odorant receptors, for which they could potentially serve as odorant presenters. Although several sequence based search methods have been exploited for protein family prediction, less effort has been devoted to the prediction of OBPs from sequence data and this area is more challenging due to poor sequence identity between these proteins.
In this paper, we propose a new algorithm that uses Regularized Least Squares Classifier (RLSC) in conjunction with multiple physicochemical properties of amino acids to predict odorant-binding proteins. The algorithm was applied to the dataset derived from Pfam and GenDiS database and we obtained overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively).
Our study suggests that RLSC is potentially useful for predicting the odorant binding proteins from sequence-derived properties irrespective of sequence similarity. Our method predicts 92.8% of 56 odorant binding proteins non-homologous to any protein in the swissprot database and 97.1% of the 414 independent dataset proteins, suggesting the usefulness of RLSC method for facilitating the prediction of odorant binding proteins from sequence information.
Olfaction is an important process to establish behavioural response and involves the binding of small, hydrophobic, volatile molecules to receptors of the nasal neuroepithelia . The olfaction mechanism has been well studied and is generally similar in vertebrates, insects, crustaceans, and nematodes [2–4]. The first step in olfaction is the solubilization of the hydrophobic odorants in the hydrophilic nasal mucus.
Odorant Binding Proteins (OBPs) play a vital role in the olfaction. OBPs are small soluble polypeptides, which are thought to act as a carrier for odorants and carries odorant from the environment to the nasal epithelium in vertebrates and sensillar lymph in insects [5, 6]. OBPs of vertebrate are members of large family lipocalin and shares eight stranded beta barrel . Insects OBPs include the general odorant-binding proteins (GOBPs) and the pheromone-binding proteins (PBPs), which are completely different from their vertebrate counterpart both in sequence and three-dimensional folding . Insect OBPs contains alpha helical barrel and six highly conserved cysteines . Another class of putative OBPs, named chemosensory proteins (CSPs) has been reported in different orders of insects, including Lepidoptera [10–12]. These polypeptides, of about 12 kDa, do not exhibit significant homology to PBPs and GOBPs and contain four conserved cysteine residues all involved in intramolecular disulphide bridges. In spite of the sequence and structural difference, their general chemical properties indicate similar functions in olfactory transduction.
Previous reports have shown that OBPs are present in large number within a species . This suggests that OBPs do play an active role in odorant recognition rather than merely serving as passive odorant shuttles [14, 15]. Several reports have demonstrated selective binding of odorants to different OBPs derived from a given species [16–18]. OBPs are also suspected to participate in the deactivation of odorants and signal termination . Presence of OBPs in non-sensory tissues of insect suggests their non-sensory roles 
Although many efforts have been made to study the role of OBPs, their physiological function is still unclear and more sequence data are required for the complete understanding of the odorant binding and transport mechanism. With the rapid increase in newly found protein sequences entering into databanks, an efficient method is needed to identify OBPs from the sequence databases. At present, prediction of the odorant binding proteins is primarily based on sequence similarity search methods [21, 22] and these methods will not be employed efficiently due to the fact that OBPs show very low sequence similarity between species and within the same species [23, 24]. So far, SVM and other statistical learning methods have not been explored for predicting odorant binding proteins. Here, we propose a method based on regularized least squares classifier (RLSC) method to predict odorant binding proteins from sequence-derived properties irrespective of sequence similarity.
Confusion matrix for RLSC on the training dataset
Classification results achieved on different feature subsets. The optimal values of σ and λ are also given.
Prediction result of 414 odorant binding proteins by RLSC, PSI-BLAST and HMM methods
Correctly predicted as odorant binding proteins
Incorrectly predicted as non odorant binding proteins
Further analysis of 414 odorant binding proteins shows that 56 proteins have no single homologous protein in the SWISSPROT  database based on PSI-BLAST search result. A similarity E-value threshold of 0.01 was used for homologue search to ensure maximum exclusion of proteins that have a homologue. Our method correctly predicts 52 proteins as odorant binding proteins. This result shows the capability of our prediction systems for recognizing novel odorant binding proteins that are non-homologous to other proteins.
In this work, a total of nine physicochemical properties, secondary structural content and frequencies of di-peptides and tripeptides were used to represent each protein sequence. It has been reported that not all feature vectors contribute equally to the classification of proteins; some have been found to play a relatively more prominent role than others in specific aspects of proteins . It is therefore of interest to examine which feature properties play more prominent roles in the classification of odorant-binding proteins. Our analysis suggests that molecular weight, hydrophobicity, hydration potential, average accessible surface area and refractivity play more prominent role. Hydrophobicity is an important factor for the formation of binding pocket and also for the interaction between OBP and odorant molecule. It is also observed that the tripeptides play significant role in our classification scheme than dipeptides.
Overall prediction accuracy of 97.7% (94.5% and 98.4% for positive and negative classes respectively) shows that RLSC is a potentially useful tool for the prediction of odorant-binding proteins. It is also a computationally efficient method for the prediction of odorant binding proteins despite the low sequence identity. Further, the capability of our method is tested by an independent dataset consisting of 414 members and this method is able to predict 97.1% of 414 odorant binding proteins. This approach can be used to identify novel odorant binding proteins from genome sequence databases using sequence-derived properties.
where k is the so-called kernel function that models the relationship between data points x i and x, and the coefficients α i 's are to be computed by training. In practice, the kernel function is usually defined before training the RLSC. And the α i 's are computed through the training process, which involves solving a system of linear equations:
(K + λn I)α = Y (2)
where α = [α 1, α 2,..., α n ] T , Y = [y 1, y 2,..., y n ] T and λ is a predefined positive constant called the regularization parameter. I is an identity matrix of size n. K is the kernel matrix, whose components can be computed as K ij = k(x i , x j ).
In our experiment, a Gaussian kernel k(x i , x j ) = exp(-σ 2 ||x i - x j ||2) is used for the RLSC since the Gaussian kernel is suggested as the first choice for most kernel methods. It is obvious that the values of the kernel-parameter σ and the regularization parameter λ are crucial to the RLSC's performance. Thus, both parameters are optimized to maximize the balanced leave-one-out accuracy. Due to the specific formulation of RLSC and our choice of LOOCV for fine tuning the parameters of a model, we can overcome the longer time problem by computing the training process only once.
All odorant binding proteins are obtained from GenDiS  and Pfam  databases. Sequences having more than 40% sequence identity are removed from the dataset. After careful manual examination, a total of 476 odorant binding proteins are considered for the construction of positive dataset which includes 40 vertebrate odorant binding proteins, 282 insect general odorant binding proteins, 46 pheromone binding proteins and 108 chemosensory proteins [see Additional file 1]. Due to the limitation in the number of known odorant binding proteins, the positive dataset could not be enhanced any further. However, in future, as more and more sequences are clarified to belong to the family, we can enrich the positive dataset. The negative samples are taken from seed proteins of Pfam protein families, which are unrelated to odorant binding proteins. Our final negative dataset consists of 2157 non-odorant binding domains [see Additional file 2].
Amino acid composition is one of the most basic characteristics of proteins and is extensively used in sequence based prediction studies . Instead of using the conventional 20-D amino acid composition, another new concept called "pseudo amino acid composition" has been reported in order to include the sequence-order information which leads to a higher success rate in sequence based prediction studies [37–40]. Owing to the wide applications of PseAA (pseudo amino acid) composition, recently, a webserver called PseAA  was designed in a flexible manner to generate various kinds of PseAA composition for a given protein sequence [37, 38] according to the needs of users. Apart from the amino acid composition, sequence-derived structural and physicochemical features have frequently been used for various prediction studies.
Amino acid groupings (11 groups) according to their physical and chemical properties
F, I, W, L, V, M, Y, C, A
R, K, N, D, E, P
R, H, K, D, E
T, H, G, S, Q
I, L, V
F, W, Y, H
N, Q, R, E, D
F, M, I, L, V
C, K, H, Y, W
P, V, A, G, T, S, N, D
Among the independent test dataset, sub-sampling (e.g., 5 or 10-fold sub-sampling) test and jackknife test, which are often used for examining the accuracy of a statistical prediction method, the jackknife test is deemed the most rigorous and objective as analyzed by a comprehensive review  and has been increasingly adopted by leading investigators to test the power of various prediction methods [47–51].
In this paper, we have used Leave-one-out (i.e., jackknife) cross-validation approach to estimating generalization performance of a classifier. It involves removing one protein from the training set, training the classifier (in our case, the RLSC) on the remaining proteins and then predicting class label of the removed (left out) protein using the trained classifier. This process was repeated until all proteins had been left out. Then the leave-one-out accuracy is computed by counting the total number of correct predictions and divided it by n (i.e. the number of samples in the original dataset).
where TP and TN denote the true positive and true negative rate, respectively.
GP, KT and PNS acknowledge the financial support offered by the A*Star (Agency for Science, Technology and Research, Singapore) under the grant # 052 101 0020. RS acknowledge National Centre for Biological Sciences (TIFR) for infrastructural and financial support. RS also acknowledges Wellcome Trust (UK) for funding. Authors thank Professor Dmitrij Frishman for his comments on this work.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.