- Methodology article
- Open Access
Predicting receptor-ligand pairs through kernel learning
- Ernesto Iacucci1,
- Fabian Ojeda1,
- Bart De Moor1 and
- Yves Moreau1Email author
https://doi.org/10.1186/1471-2105-12-336
© Iacucci et al; licensee BioMed Central Ltd. 2011
- Received: 21 February 2011
- Accepted: 11 August 2011
- Published: 11 August 2011
Abstract
Background
Regulation of cellular events is, often, initiated via extracellular signaling. Extracellular signaling occurs when a circulating ligand interacts with one or more membrane-bound receptors. Identification of receptor-ligand pairs is thus an important and specific form of PPI prediction.
Results
Given a set of disparate data sources (expression data, domain content, and phylogenetic profile) we seek to predict new receptor-ligand pairs. We create a combined kernel classifier and assess its performance with respect to the Database of Ligand-Receptor Partners (DLRP) 'golden standard' as well as the method proposed by Gertz et al. Among our findings, we discover that our predictions for the tgfβ family accurately reconstruct over 76% of the supported edges (0.76 recall and 0.67 precision) of the receptor-ligand bipartite graph defined by the DLRP "golden standard". In addition, for the tgfβ family, the combined kernel classifier is able to relatively improve upon the Gertz et al. work by a factor of approximately 1.5 when considering that our method has an F-measure of 0.71 while that of Gertz et al. has a value of 0.48.
Conclusions
The prediction of receptor-ligand pairings is a difficult and complex task. We have demonstrated that using kernel learning on multiple data sources provides a stronger alternative to the existing method in solving this task.
Keywords
- Domain Content
- Multiple Data Source
- Phylogenetic Profile
- Kernel Learning
- Chemokine Family
Background
Regulation of cellular events is initiated, often, via extracellular signaling. Extracellular signaling occurs when a circulating ligand interacts with one or more membrane-bound receptors. Identification of receptor-ligand pairs is thus an important and specific form of protein-protein interaction (PPI) prediction. While the problem of predicting PPI has been highly studied, little effort has been placed on the sub-problem of predicting receptor-ligand interactions.
A tremendous amount of research has been applied to the problem of predicting PPI. Foremost in the field has been prediction via phylogenetic profile analysis. Generally, this type of investigation studies the similarity of the phylogenetic history of a protein A and its putative protein partner B. Broadly speaking, the assortment of these types of phylogenetic studies examine the most accurate measure of similarity between the phylogenetic histories of A and B. Findings from these studies support the idea that proteins which interact have similar phylogenetic profiles, as these proteins should co-adapt as they are under the same evolutionary pressures [1, 2].
Bhardwaj et al. [3] make use of the phylogenetic information strategy while introducing expression data to predict PPI. Their findings support the idea that integrating gene expression profile and phylogenetic information increases the accuracy of predictions than phylogenetic analysis alone. The rational of using co-expression as an indicator of PPI originates from the observation that proteins which interact for the purpose of performing a similar function are likely to be co-expressed as they will need to be present at the same time to carry out their common biological activity [4, 5].
The notion of combining expression and phylogenetic information to predict PPI is clearly a step in a direction which leads us to consider a wider variety of data integration. Here we propose a framework in which other sources of data (such as domain content) can be applied to a kernel solution to the problem. The rational behind incorporation of domain content information is as follows: as certain domains are known to interact, it is self evident that this data would provide additional insight into the problem of determining receptor-ligand pairs.
One of the groups which have tackled the receptor-ligand prediction task is Gertz et al. [6]. In their work, they match members of the chemokines and tgfβ ligand families with their respective receptor families. They used distance matrices of the receptors and ligands families to measure similarity between the groups. Through a Metropolis Monte Carlo optimization algorithm, they explored and scored possible matches between the two matrices, until they reached optimal solutions. While their work was successful, they rely only on phylogenetic distance matrices, here we propose the integration of multiple data sources to help make more accurate matches.
We look into the use of creating a combined kernel classifier to carry out this learning task. While many kernel-based machine learning techniques have been applied to the PPI task [7, 8], it has hitherto never, to our knowledge, been used on the receptor-ligand problem. Kernel learning provides the means to utilize enigmatically related data (such as expression measures, domain content, etc.) and perform classification in higher dimensional space via kernel methods. In our work, we apply the least-squares support vector machines (LS-SVM) method based on the conclusions by Suykens et al. [9] which shows this implementation to be robust. As different data sources are used, separate LS-SVM kernel classifiers were built and the combined output used to provide a final result.
While the task addressed here is the predictions of successful protein ligand-receptor pairings, a related area of research is the protein-chemical interaction prediction task for which kernels have, sometimes, been applied. For example, Nagamine et al. [10] approach this task through the use of a SVM trained on vector representations of protein-chemical pairs. Building on this, Jacob et al. [11] demonstrate the utility of using hierarchical kernels to match proteins with chemical ligands in a similar learning task. This line of research was then further advanced by Bleakley et al. [12] who introduce the use of bipartite local models which use kernels to successfully predict several reported drug-target interactions.
We first describe our combined kernel classifier method to predict receptor-ligand pairings. We then present the bipartite-graph we derive from our findings and compare it to our "Golden Standard" and to results previously published by Gertz et al. [6]. Following this, we interpret the performance of our method with respect to this comparison. To conclude, we discuss the benefits and limitations of our method and possible future directions for this work.
Methods
1. Problem Formulation
Our objective is to predict candidate receptor-ligand pairs; more specifically, we seek to create a method to identify known pairs as well as to determine putative pairs for further research. Our method involves using multiple data sources (expression, phylogenetic, and protein-domain content information), computing separate kernels for each data type, creating LS-SVM classifiers and combining the results to predict receptor-ligand pairs.
2. Data Sources
For the datasets used, our setting is as follows, candidate receptor and ligand sequences were retrieved for seven species (Rattus norvegicus, Mus musculus, Homo sapiens, Pan troglodytes, Canis familiaris, Cavia porcellus, and Bos taurus) from ensemble build 51 [13]. The sequences were then aligned using ClustalW [14]. Once aligned, the sequences were edited so as to eliminate the positions which had the 5% lowest substitution scores across the seven orthologous sequences. The pair-wise alignment score was then taken for each possible species to species comparison between the edited orthologous sequences (as seven species are used, a total of 21 pair-wise comparisons for each candidate is created). The distance scores form a phylogenetic vector [2] which will then be used to create the phylogenetic kernel.
The expression for the candidates was taken from the well-known GNF human expression atlas (79 tissues) [15], the data was normalized (values were mean-zeroed and the standard deviation was set to one) and was further transformed into the expression kernel.
The domain content of each candidate protein (receptor or ligand) was taken from the Interpro Database [16]. A vector for each candidate protein was created where the presence of a protein domain was indicated with a'1' and the absence of a domain was indicated by a '0'. This data was then transformed to create the domain content kernel.
3. Kernels and LS-SVM Classifier
The above mentioned data matrices (phylogenetic, expression, and domain content) were used to create three kernels for each receptor-ligand family. LS-SVMs [9] were trained using the three kernels to predict outcomes for receptor-ligand pairs known from our Database of Ligand-Receptor Partners (DLRP) "Golden Standard".
Work flow of the combined kernel classifier. For each candidate, Data was partitioned into training and validation sets and parameters were tuned using a 5 fold validation strategy. The final output of the classifiers was achieved by a leave one out strategy. The classifier values were combined for a final result and a threshold was applied to determine which values are predicted edges in the receptor-ligand bipartite graph.
4. Construction of the Receptor-Ligand Bipartite Graph
We take as our "Golden Standard" the receptor-ligand dataset from the The Database of Ligand-Receptor Partners (DLRP) [17]. In this dataset, cytokines and interleukins (as well as other ligands) are taken and paired with their corresponding receptor partners. These interactions are then represented in an adjacency matrix where an interaction is represented as a '1' and lack of interaction is represented as a '0'. These are the values we are ultimately trying to predict.
Results and Discussion
Chemokine receptor-ligand predictions
Gertz et al.(2003) | Iacucci et al. | |||
---|---|---|---|---|
Receptor | Supported | Unsupported | Supported | Unsupported |
CCR3 | SCYA24 | -- | SCYA26 SCYA7 SCYA5 SCYA15 SCYA11 SCYA13 | SCYA3 SCYA17 SCYB6 SCYA4 SCYA21 SCYB5 IL8 SCYA27 |
CCR1 | SCYA2 SCYA8 SCYA13 SCYA7 | SCYA11 SCYA1 | SCYA3 SCYA7 SCYA8 SCYA5 SCYA15 SCYA13 | SCYA26 SCYA17 SCYB6 SCYA4 SCYA21 SCYB5 IL8 SCYA11 SCYA27 |
CCR5 | SCYA3 SCYA4 SCYA5 | SCYA14 SCYA15 SCYA23 | SCYA3 SCYA4 SCYA5 | SCYA26 SCYA17 SCYB6 SCYA21 SCYA15 SCYB5 IL8 SCYA11 SCYA27 |
IL8RA | SCYB6 SCYB5 | GRO1 GRO2 GRO3 PPBP | SCYB6 IL8 | SCYA17 SCYA4 SCYA21 SCYA15 SCYA13 |
CCR4 | -- | SCYA26 | SCYA17 | SCYA3 SCYA26 SCYA7 SCYA8 SCYB6 SCYA21 SCYB5 SCYA11 SCYA13 SCYA27 |
CCR2 | -- | SCYA21 SCYA19 | -- | SCYA3 SCYB6 SCYA4 SCYA21 SCYA5 SCYA15 SCYB5 SCYA11 |
CCR8 | SCYA17 | SCYA22 | -- | SCYA3 SCYA26 SCYA7 SCYA8 SCYB6 SCYB5 IL8 SCYA11 SCYA27 |
CXCR3, GPR9 | -- | SDF1 | SCYA21 SCYA11 | SCYA3 SCYA26 SCYA17 SCYA7 SCYA8 SCYB6 SCYA4 SCYA15 SCYB5 IL8 SCYA13 SCYA27 |
IL8RB | IL8 | -- | SCYB6 SCYB5 IL8 | SCYA3 SCYA17 SCYA7 SCYA8 SCYA4 SCYA21 SCYA5 SCYA15 SCYA11 SCYA13 SCYA27 |
BLR1, CXCR5 | -- | MIG SCYB10 SCYB11 | -- | -- |
CCBP2, CCR9 | SCYA25 | -- | -- | SCYB6 SCYA21 |
CCR6 | -- | SCYB13 | -- | SCYA26 |
CXCR4 | -- | SCYA27 | -- | -- |
CCR7 | -- | SCYA20 | -- | SCYA3 SCYA26 SCYA8 SCYA5 SCYA15 SCYB5 |
We discover that our predictions for the tgfβ family accurately reconstruct over 76% of the supported edges (0.76 recall and 0.67 precision) of the receptor-ligand bipartite graph defined by the DLRP. In addition, the combined kernel classifier is able to relatively improved upon the Gertz et al. [6] work by a factor of approximately two as the Gertz et al. [6] work reconstructs 44% of the supported edges (0.44 recall and 0.53 precision) of the receptor-ligand bipartite graph defined by the DLRP. For this family of receptors and ligands, there exists an advantage in our approach to make predictions as we reconstruct more known edges and introduce less noise. Comparing F-measures, we see that our method improved upon that of Gertz et al. [6] significantly as our method has an F-measure of 0.71 while that of Gertz et al. [6] has a value of 0.48.
Tgfβ recptor-ligand predictions
Gertz et al (2003) | Iacucci et al | |||
---|---|---|---|---|
Supported | Unsupported | Supported | Unsupported | |
TGFBRII | Tgfb2 Tgfb3 Tgfb1 | -- | ||
BMPRIa | -- | Gdf5 | BMP8 BMP4 BMP2 BMP15 BMP3 BMP6 BMP10 | Tgfb2 Tgfb3 Tgfb1 INHBA INHA INBB INHC |
AMHR | -- | -- | -- | -- |
BMPRIB | Bmp2 Bmp4 | -- | BMP8 BMP4 BMP2 BMP15 BMP3 BMP6 BMP10 | Tgfb2 Tgfb3 Tgfb1 INHBA INHA INBB INHC |
ACTRIIa | ActivinBA | -- | BMP7 BMP5 BMP8 BMP4 BMP2 BMP15 BMP3 BMP6 BMP10 INHBA INHA INHBB INHBC | Tgfb3 Tgfb2 Tgfb1 |
ACTRIIb | ActivinBB | -- | BMP7 BMP5 BMP8 BMP4 BMP2 BMP15 BMP3 BMP6 BMP10 INHBA INHA INHBB INHBC | Tgfb2 Tgfb3 Tgfb1 |
SAX | -- | Bmp3 | -- | -- |
TKVR | -- | Bmp10 | -- | -- |
ACTRII | -- | DPP | -- | -- |
TGFBRI | -- | Bmp7 Bmp6 | -- | -- |
BMPRII | -- | Gdf8 | BMP8 BMP4 BMP2 BMP15 BMP3 BMP6 BMP10 | Tgfb2 Tgfb3 Tgfb1 INHBA INHA INBB INHC |
We also find that our predictions for the chemokine family accurately reconstruct over 65% of the supported edges (0.65 recall and 0.23 precision) of the receptor-ligand bipartite graph defined by the DLRP. In addition, the combined kernel classifier is able to relatively improved upon the Gertz et al. [6] work by a factor of approximately three as the Gertz et al. [6] work reconstructs 22% supported edges (0.22 recall and 0.37 precision) of the receptor-ligand bipartite graph defined by the DLRP. While the precision of the Gertz et al. [6] is higher, the recall of our method is about three fold higher. Comparing F-measures, we see that our method improved upon that of Gertz et al. [6], slightly as our method has an F-measure of 0.33 while that of Gertz et al. [6], has a value of 0.27.
Qualitatively, the performance of our method also seems to be matching the performance of Gertz et al. [6], as the novel interaction of CCR1 with SCY11 [18] reported in their work is also discovered using our method.
Schematic of receptor-ligand bipartite graph and performance measures. (a) Schematic of in-vivo receptor-ligand interaction as found interacting in the cell membrane (b) Bipartite graph schematic of the receptor-ligand interaction network (c) Performance measures of tgfβ and chemokine bipartite graph construction across Gertz et al. (2003) and Iacucci et al. methods.
The combined classifier performs better using the tgfβ family of receptors and ligand than using the chemokine family of receptors and ligands. This can be attributed to two reasons. Firstly, the tgfβ has more positive examples than the chemokine family to train with. Secondly, the tgfβ family is more evolutionarily related while the chemokine family is related by function. Thus, it is more difficult to learn with data from the chemokine family as there is less evolutionarily related structure inherent to the data for the LS-SVM to learn with.
The benefits of the combined kernel classifier method are clear. Foremost in the advantages are the ability to predict multiple ligands for one receptor, which represents an imperative feature for receptor-ligand research. In addition, as the classifier output is continuous, the results can be considered to be prioritized, this presents a major convenience to researchers as often the set of candidate ligands are large and resources to validate few. The major limitation of the method rests in the need to have training examples for receptor-ligands which one is trying to predict. This is particularly true for predicting the pairing in the chemokine dataset as when we consider only ligand candidates with two or more receptor pairings, the precision performance of our method improves (0.79 recall and 0.31 precision) (see Table 1).
The advantage of using the three sub-classifiers instead of a global classifier which combines all features is two fold. The first reason is that the data sources used here are disparate and heterogeneous. A global classifier would require a mapping step which may introduce some noise. The second reason is that using separate sub-classifiers would allow for removal and addition of sub-classifiers. For example, if a better micro-array dataset becomes available in the future, it would be an advantage to be able remove the existing expression-based kernel with one derived from the new dataset without having to the retrain a global classifier. Also, if additional data sources become available, adding an additional sub-classifier based on the new data source would take less time to train than adding the data source and retraining the global classifier.
A practical advantage of using three sub-classifiers in our work became apparent when considering the performance of the individual classifiers versus that of the combined kernel classifier. More specifically, the combined kernel classifier performed equally as well or better than any of the individual classifiers. In the case of the chemokine family, the performance of all three individual classifiers was not nearly as good as the combined kernel classifier. In the case of the tgfβ family, the expression classifier performed nearly as well as the combined kernel classifier (see Additional File 1, Table S1).
Conclusions
The prediction of receptor-ligand pairings is a difficult and complex task. We have demonstrated that using multiple data sources provide an advantage over single data sources in solving this task. The use of multiple data sources allows us to extend our method as new data becomes available. Among our main contributions we count the ability of our method to prioritize candidate pairs, which represents an imperative feature for receptor-ligand research. As in-vivo validation is costly and time consuming, it's important that researchers have a ranking of a, potentially, large number of candidates. In addition, we provide a method which has high recall (0.76 and 0.67) and improved F-measures when compared to Gertz et al. [6] (0.71 for Iacucci et al. vs 0.48 for Gertz et al. [6] when evaluationg the tgfβ family and 0.33 for Iacucci et al. vs 0.27 for Gertz et al [6] when evaluating the chemokine family). Thus, the method is reliable in so far that it will retrieve a large portion of the true positives while not introducing too much noise. As more high throughput data becomes available, we expect to extend the current methodology to accommodate it.
Declarations
Acknowledgements
The authors also give thanks to Dr. Shi Yu, Dr. Léon-Charles Tranchevent, and Dr. Anneleen Daemen for their thoughtful suggestions.
Funding: The authors would like to acknowledge support from:
*Research Council KUL: ProMeta, GOA Ambiorics, GOA MaNet, CoE EF/05/007 SymBioSys, START 1, several PhD/postdoc & fellow grants
*Flemish Government:
o FWO: PhD/postdoc grants, projects, G.0318.05 (subfunctionalization), G.0553.06 (VitamineD), G.0302.07 (SVM/Kernel), research communities (IC-CoS, ANMMM, MLDM); G.0733.09 (3UTR); G.082409 (EGFR)
o IWT: PhD Grants, Silicos; SBO-BioFrame, SBO-MoKa, TBM-IOTA3
o FOD:Cancer plans
* Belgian Federal Science Policy Office: IUAP P6/25 (Bio-MaGNet, Bioinformatics and Modeling: from Genomes to Networks, 2007-2011);
* EU-RTD: ERNSI: European Research Network on System Identification; FP7-HEALTH CHeartED
The scientific responsibility is assumed by its authors.
Authors’ Affiliations
References
- Izarzugaza JM, Juan D, Pons C, Ranea JA, Valencia A, Pazos F: TSEMA: interactive prediction of protein pairings between interacting families. Nucleic Acids Res 2006, 34: W315-W319. 10.1093/nar/gkl112PubMed CentralView ArticlePubMedGoogle Scholar
- Sato T, Yamanishi Y, Kanehisa M, Toh H: The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships. Bioinformatics 2005, 21: 3482–3489. 10.1093/bioinformatics/bti564View ArticlePubMedGoogle Scholar
- Bhardwaj N, Lu H: Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics 2005, 21: 2730–2738. 10.1093/bioinformatics/bti398View ArticlePubMedGoogle Scholar
- Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res 2001, 29: 3513–3519. 10.1093/nar/29.17.3513PubMed CentralView ArticlePubMedGoogle Scholar
- Ge H, Liu Z, Church GM, Vidal M: Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat Genet 2001, 29: 482–486. 10.1038/ng776View ArticlePubMedGoogle Scholar
- Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, Rothschild B: Inferring protein interactions from phylogenetic distance matrices. Bioinformatics 2003, 19: 2039–2045. 10.1093/bioinformatics/btg278View ArticlePubMedGoogle Scholar
- Kim S, Yoon J, Yang J, Park S: Walk-weighted subsequence kernels for protein-protein interaction extraction. BMC Bioinformatics 2010, 11: 107. 10.1186/1471-2105-11-107PubMed CentralView ArticlePubMedGoogle Scholar
- Miwa M, Saetre R, Miyao Y, Tsujii J: Protein-protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inform 2009, 78: e39-e46. 10.1016/j.ijmedinf.2009.04.010View ArticlePubMedGoogle Scholar
- Suykens JA, Vandewalle J, De MB: Optimal control by least squares support vector machines. Neural Netw 2001, 14: 23–35. 10.1016/S0893-6080(00)00077-0View ArticlePubMedGoogle Scholar
- Nagamine N, Sakakibara Y: Statistical prediction of protein chemical interactions based on chemical structure and mass spectrometry data. Bioinformatics 2007, 23: 2004–2012. 10.1093/bioinformatics/btm266View ArticlePubMedGoogle Scholar
- Jacob L, Vert JP: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24: 2149–2156. 10.1093/bioinformatics/btn409PubMed CentralView ArticlePubMedGoogle Scholar
- Bleakley K, Yamanishi Y: Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics 2009, 25: 2397–2403. 10.1093/bioinformatics/btp433PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al.: Ensembl 2009. Nucleic Acids Res 2009, 37: D690-D697. 10.1093/nar/gkn828PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101: 6062–6067. 10.1073/pnas.0400782101PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, et al.: InterPro: the integrative protein signature database. Nucleic Acids Res 2009, 37: D211-D215. 10.1093/nar/gkn785PubMed CentralView ArticlePubMedGoogle Scholar
- Graeber TG, Eisenberg D: Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat Genet 2001, 29: 295–300. 10.1038/ng755View ArticlePubMedGoogle Scholar
- Gao JL, Sen AI, Kitaura M, Yoshie O, Rothenberg ME, Murphy PM, Luster AD: Identification of a mouse eosinophil receptor for the CC chemokine eotaxin. Biochem Biophys Res Commun 1996, 223: 679–684. 10.1006/bbrc.1996.0955View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.