PRIN: a predicted rice interactome network
© Gu et al; licensee BioMed Central Ltd. 2011
Received: 15 December 2010
Accepted: 16 May 2011
Published: 16 May 2011
Skip to main content
© Gu et al; licensee BioMed Central Ltd. 2011
Received: 15 December 2010
Accepted: 16 May 2011
Published: 16 May 2011
Protein-protein interactions play a fundamental role in elucidating the molecular mechanisms of biomolecular function, signal transductions and metabolic pathways of living organisms. Although high-throughput technologies such as yeast two-hybrid system and affinity purification followed by mass spectrometry are widely used in model organisms, the progress of protein-protein interactions detection in plants is rather slow. With this motivation, our work presents a computational approach to predict protein-protein interactions in Oryza sativa.
To better understand the interactions of proteins in Oryza sativa, we have developed PRIN, a Predicted Rice Interactome Network. Protein-protein interaction data of PRIN are based on the interologs of six model organisms where large-scale protein-protein interaction experiments have been applied: yeast (Saccharomyces cerevisiae), worm (Caenorhabditis elegans), fruit fly (Drosophila melanogaster), human (Homo sapiens), Escherichia coli K12 and Arabidopsis thaliana. With certain quality controls, altogether we obtained 76,585 non-redundant rice protein interaction pairs among 5,049 rice proteins. Further analysis showed that the topology properties of predicted rice protein interaction network are more similar to yeast than to the other 5 organisms. This may not be surprising as the interologs based on yeast contribute nearly 74% of total interactions. In addition, GO annotation, subcellular localization information and gene expression data are also mapped to our network for validation. Finally, a user-friendly web interface was developed to offer convenient database search and network visualization.
PRIN is the first well annotated protein interaction database for the important model plant Oryza sativa. It has greatly extended the current available protein-protein interaction data of rice with a computational approach, which will certainly provide further insights into rice functional genomics and systems biology.
PRIN is available online at http://bis.zju.edu.cn/prin/.
Proteins seldom perform their biological function independently. Rather, they collaborate with other biological molecules such as nucleic acids and proteins to accomplish complex biological processes. Protein-protein interactions play fundamental roles in almost all biological processes such as signal transduction, internal equilibrium maintenance and organs formation . Consequently, mapping genome-wide protein-protein interactions has been one of the key tasks of systems biology to understand cellular processes .
High-throughput experiments, like Yeast two-hybrid system (Y2H), AP-MS method and Bimolecular fluorescence complementation (BiFC) , have been employed widely at genome-scale to construct protein-protein interaction networks of model organisms such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens and Escherichia coli K12 [3–11]. But large scale experiments are far from widespread use due to huge financial costs and time consuming experiments. Computational approaches provide a rapid and convenient scan for a preliminary sight before the commencement of comprehensive observations of experimental proteins interaction. They also provide reasonable complements to existing experimental protein interaction networks. With in-depth study of experimental protein interactions, especially the increase of model organisms' protein interaction data generated by large-scale and high-throughput experiments, computational approaches to predict protein-protein interactions in a particular species have been increasingly efficient and effective. Combined with literature extraction of existing protein interactions , genomic information, protein structure and annotation information, bioinformatics play an important role in method study of protein-protein interaction prediction, high-quality protein-protein interaction databases establishment, software and webserver development for visualizing protein-protein interaction networks and genome-scale analysis of protein interaction networks [13–18].
Although protein-protein interactions confirmed in the lab are in a low coverage of the whole proteome, including those of model organisms such as human and fruit fly [7–9, 11], these findings can offer useful biological information for the prediction of novel protein-protein interactions in a particular species of interest. Machine learning methods like Naïve Bayes  and SVM  have been used to extract biological information from golden-standard protein interaction data to model classifiers for prediction. Such computational methods depend highly on the reliability of golden-standard data; hence show limitations in organisms that have little existing experimental data for training.
Genome information is another important source for protein-protein interaction prediction. These types of method usually use genome information (gene neighbourhood, gene fusion, domain fusion, gene co-expression, phylogenetic profile, subcellular co-location, domain interaction and GO similarity etc.) to obtain functional dependence between protein pairs . Gene neighbourhood hypothesizes that if proteins in different genomes are corresponding to adjacent genes, these proteins are considered to be functionally related and therefore are more likely to interact with each other . Gene fusion means that if two functionally related proteins in a genome possess homologous similarities, and they can be fused into a compound protein, these two proteins are likely to interact with each other . Phylogenetic profile describes the presence of homologous proteins in a series of species. By clustering phylogenetic profiles, proteins with similar or identical phylogenetic profile patterns are considered to be functionally linked, and they are more likely to interact with each other as well . Early methods usually considered genomic information independently (domain-domain interaction methods used in SynechoNET ), but more and more prediction methods combine several or all genome information together to improve the precision of prediction (integrated methods used in AtPID [25, 26]).
Methods based on evolutionary information such as correlated mutation, interologs [11, 27], correlated evolutionary rate have achieved dramatic improvements in cross-species protein-protein interaction prediction. Evolutionarily conserved protein-protein interaction is based on the theory of evolutionary conservation of protein, which is known as ortholog. The interolog method is mainly dependent on protein ortholog algorithms such as InParanoid [28–30]. Orthologous proteins are used to locate conserved protein-protein interactions among species. It has been proved that many pathways such as GTPase signaling transduction significantly show their evolutionary conservation in different species, especially the pathway motif (patterns that recur within pathways much more often than expected at random), appears in many different pathways . Over the years these prediction methods of protein interactions have been successfully applied in human, yeast, fruit fly and other model species, achieving appreciated results [32, 33]. The Online Predicted Human Interaction Database (OPHID ) extracts the evolutionarily conserved orthologous protein-protein interactions from Saccharomyces cerevisiae and Drosophila melanogaster with the interolog method, which is then combined with literature mining data to construct the human protein interaction network. The most important trend of interolog methods is taking both the orthologous information and genomic information into consideration to obtain high quality protein-protein interaction networks, such as the approaches applied in MPID  and AthPPI .
The complexity of plant materials presents a big obstacle to find analytical protein-protein interactions in plant proteomics research . The genome-scale experimental approach-based plant interactome has not been constructed, only a few protein-protein interaction networks are constructed to address several particular biological questions. Arabidopsis thaliana is the only plant species in which a global-goal applicable interactome was computationally constructed . There are several Arabidopsis thaliana protein-protein interaction databases with different approaches: AtPIN  and AtPID [25, 26] with integration approach, AthPPI  and Predicted Arabidopsis Interactome  with interolog approach, and PAIR [39, 40] with machine learning approach. However, publicly available computational protein-protein interaction resource for the model monocotyledon Oryza sativa is still lacking.
Rare experimental protein interaction data and low level of genomic annotation information are the two main barriers for computational methods to be widely used for Oryza sativa. Machine learning methods such as SVM and Naïve Bayes Network require both high quality golden-standard experimental data and huge genomic annotations. As a result, interolog method combined with limited rice genome information appears to be a realizable way to construct an unprecedented rice protein interaction network. In this study, we attempt to computationally depict a panorama of rice interactome with interolog method, where genome information such as GO annotations, subcellular localization information and gene expression data are utilised to validate the predicted protein interaction network and at the same time, to extract significant biological network properties.
Datasets based on InParanoid7 ortholog pairs without score cut-off were also generated in our study, 1,144,911 protein interactions with 12,709 rice proteins were predicted totally (can be downloaded from our website), which showed high redundancy and low sensitivity based on our examination. As reported by Huang et al. , true positive rate of prediction is significantly reduced with the reduction of the InParanoid score cut off, we omitted all this huge amounts of data from reliable rice protein interactions, and score cut off with 1.0 was taken as an internal quality control.
We lead Relative specificity similarity (RSS) scores [56–58] into PRIN to evaluate the GO correlation coefficient between two interaction proteins. Relative specificity similarity scores are mainly based on GO term similarity and GO depth (see Methods). Three independent RSSGO scores were calculated separately, RSScc scores for GO cell component terms, RSSBP scores for GO biological process terms and RSSMF scores for GO molecular function terms. As the RSSGO distribution shown in Figure 3C, RSScc scores of our network mainly fall within 0.6~1; RSSBP scores mainly fall within 0.4~0.8; RSSMF scores mainly fall within 0.4~0.9. The high proportion of RSSMF score 0~0.1 is mainly due to the imperfection of rice GO molecular function terms. All the proportions of score 1.0 in three kinds RSSGO scores are very high: 41.77% for RSScc scores, 19.81% for RSSBP scores and 27.38% for RSSMF scores. This shows that two proteins in PRIN share stronger correlation in GO annotation, indicating a higher possibility to interact with each other.
Subcellular localization is the specific location a protein or gene product exists in where the cell, such as nuclear, cytoplasm or cell membrane. It plays an important role in understanding cellular organs function and compartmentalization characteristics. Proteins have to fulfill the conditions for space identity during interactions; hence interacting proteins tend to possess the same subcellular localization, known as co-localization. Currently there is no specific subcellular localization database for rice. The rare and scattered rice subcellular localization information presents a difficult task for figures collection. To tackle this, we opted for a computational subcellular localization identifier RSLpred , which is signally better than another identifier Plant-PLoc  because of its rice species-specific characteristic. With the integration of EBI and TIGR rice protein subcellular localization as predicted by RSLpred, we finally obtained 14,308 interactions with subcellular localization information in our predicted interactome, in which 49.1% is co-localized. Four kinds of protein subcellular localizations were catalogued: chloroplast, cytoplasm, mitochondria and nucleus. Nucleus-nucleus co-localization was responsible for the largest share (nearly 44.3%) among all the protein interactions with subcellular localization information; this may not be surprising because 64.2% of proteins having subcellular localization annotations were predicted to nucleus localized by RSLpred.
The web interface of PRIN was developed with JAVA (Struts, iBATIS, Spring frameworks) and was hosted on an Apache web server. The project used MySQL 5.0 as its database management system and Cytoscape Web  to visualize the protein-protein interaction network. We provide two ways to access our database: (i) a protein can be queried by its symbol name, Loc number, or UniProt accession in the Protein Search page, then our server will return all the proteins which are predicted to interact with the submission. (ii) If users have a list of proteins and want to know whether they interact between each other, just paste this list of proteins into the submit box of Interaction Search page, then our server will return all the interactions involving these proteins. PRIN provide both graphical results and table results (containing PPI ID, protein ID, Interolog species, co-localization, co-expression PCC score and RSS GO score) for users to get proteins and interactions information. More detail information can be seen through clicking protein ID and interaction ID in the result table.
A small data set of experiments determined rice protein-protein interaction including 406 proteins and 430 interactions is integrated from BIND  (Additional File 3), IntAct  and PlaPID . Although this experimental interactome is too small a coverage on the rice whole interactome, 95 proteins are also found in our network, which constitute 230 interactions in our network and 66 interactions in the integrated experimental network. Among these 66 interactions, 20 have been determined by experiments, revealing a reasonable sensitive considering the rare and low coverage experimental data.
RSSGO score has been used as a reliable data training method in earlier protein interaction prediction studies [56–58]. In our study, RSSGO score was taken as an inspection method towards our predicted data. We calculated the RSSGO score of protein pairs in our network, both of RSSCC score, RSSBP score and RSSMF score above 0.5 is counted. It was found that 78.9% of them are in a high co-annotated level as RSSGO score cut off 0.5. This indicates that protein pairs in PRIN more likely to participate in the same bioprocess, exhibits similar molecular function and constitute to the same cellular structure, which all leads to a high possibility of protein interaction.
We mapped the gene expression data to the 430 experimental interactions, and successfully found 368 of them with PCC scores, in which 76 pairs having co-expression. Although statistical meaning is obviously lacking due to the limitation of rare experimental data, it significantly indicates that interaction proteins have a tendency to co-express. Protein pairs in our network show an obviously high co-expression rate (28.3%), compared to random pairs derived from global rice genome. We calculated co-expression rate of all gene pairs appeared in the microarray, and only 2.3% of them display co-expression pattern with Pearson correlation coefficient score cut off 0.5. Therefore, the co-expression quality of our predicted network is highly notable.
The comparison of interactome topology between rice and model organisms
Escherichia coli K12
The node degree of an interactome shows the number of edges a protein is linked to, where self-interaction is also counted. The node degree distribution of our network shows the scale-free network topologies of rice protein-protein interaction network. Apart from self-interacted proteins, proteins in PRIN mainly fall within 2-10 degree hubs, as shown in Figure 5C. We constructed hub catalog by partly reference to Jane's studies . As shown in Figure 5D, the node degree distribution of our network shows the scale-free network topological property. Most proteins in scale-free networks have low interactions, and a few proteins have high network connection degrees, which are called distribution node proteins. Scale-free protein interaction network is subject to distribution node proteins, and it has high tolerance to sudden environment pressure.
The shortest path length distribution, as presented in Figure 5E, demonstrates the significant small world property of our predicted rice interactome. The small world properties imply a strong fault-tolerance to our network, as well as to real protein-protein interaction networks . The Small world property shows good fault tolerance and stability of our network. When expressions of a few proteins are suppressed under environment pressure, biological pathways will not be ended but can be completed by other alternative proteins. In such small world networks, information transmission speed is very fast, corresponding with rapid changes of environment pressure. The stress centrality counts how many times a protein being passed through by a shortest path. If a protein is passed by a high amount of shortest paths, it experiences higher stress, suggesting that it has more important biological function . The stress centrality distribution is shown in Figure 5F.
The clustering coefficient is a ratio N/M, where N is given as the number of edges between the neighbours of a protein, and M is given as the maximum number of edges that could possibly exist between the neighbours of a protein. It is calculated as Cn = 2e n /(k n (k n -1)), where k n is the number of neighbours of n and en is the number of connected pairs between all neighbours of protein n . The average of the clustering coefficients of proteins in different degree is shown in Figure 5G. The high clustering coefficient value suggests that protein-protein interaction in rice is highly modular, and cellular function in real PPI network is likely to be implemented in a highly modular approach. Research in metabolic networks using the average clustering coefficient distribution has shown the modular tendency in metabolic networks . Therefore, clustering coefficient is a very useful methodology to identify functional modular in rice protein-protein interaction network.
The neighbourhood connectivity of a protein, defined by NetworkAnalyzer, is the average interaction numbers of all neighbours of this protein. As shown in Figure 5H, the neighbourhood connectivity of PRIN first increased followed by decreased. This indicates that proteins in low degrees (<30) tend to interact with those of proteins in low degrees, but in high degrees field (>30), it is more prevail that proteins in high degrees interact with those of low degree proteins in PRIN .
PRIN is based on a sophisticated computational method known as interologs, combined with the genomic features of rice. There are certain inner quality controls in our network construction: the huge amount of integrated model organisms' protein-protein interactions, manual proofreading mismatch IDs in database integration, restricted orthologous data with top InParanoid score and manual verification of the resulting network. Genomic feature of rice, such as GO annotations, subcellular location and gene expression data, is mapped to PRIN in order to validate our network and obtain biologically significant results as well. Finally we acquired 76,585 desirable interactions among 5,049 proteins (Additional File 5). According to the comparisons with small experiment interactome and random interactome, PRIN shows satisfactory tendency in co-GO annotation, co-localization and co-expression, making it reliable for perspective studies in rice functional biology and systems biology. A well-organized web interface has been developed for network visualization and database search, which will be updated weekly for new interologs detection. It is publicly available at http://bis.zju.edu.cn/prin/. We have found many conserved basic metabolic pathways among species through the interolog process and most excitingly new protein complexes join known pathways. Pathways expansion, metabolic module detection, and protein complex functional annotation based on PRIN will be the most important features for our further comprehensive genomic functional determination in PRIN.
Our prediction is based on existing methods known as interolog. Interolog method is based on a simple logical principle: if 'protein A' and 'protein B' in a specific species are orthologous with 'protein A1' and 'protein B1' respectively in another species, and the interaction between 'protein A1' and 'protein B1' has already been verified experimentally in the reference species, 'protein A' and 'protein B' would then be predicted to interact with each other. If interolog of protein A and protein B is found in more than one species, the reliability of their interaction is increased. Interolog method based on multi-species considers evolution conservation between protein interaction pairs, therefore naturally possesses better sensitivity in cross species prediction.
Integration of six model organism interactomes is based on our own integration methods. High-throughput experiments determined and literatures derived protein-protein interaction data of 6 model organisms were gained from public protein-protein interaction databases: BioGrid, IntAct, DIP and MINT. Additionally, species-specific protein-protein interaction databases such as HPRD for human, MIPS for yeast, and TAIR for Arabidopsis were also utilised, providing a significant number of high-quality protein-protein interaction data. An ID dictionary was created to provide cross-database ID mapping, which is based on Biomart, PIR ID mapping service, Uniprot ID mapping service, documents from Swissprot and script extraction from Uniprot XML files. ID mismatching and multi-matching were manually corrected in our integration, and ID in old version was merged into new version or deleted. The ortholog data were gained from InParanoid database between rice and 6 model organisms independently. InParanoid compared all the protein sequence pairs in a species through InParanoid's own algorithm, which is based on blast calculation but not simply the best blast score. Protein with the highest similarity is selected as a candidate protein, ensuring that there is no other protein and its candidate protein has a higher similarity. All orthologous proteins in two species were obtained through these screening methods [28–30]. Only the top pairs clustered by InParanoid core cut-off 1.0 were selected, exerting certain controls on false positive rate of the data. Some orthologs with low score that produce correct interactions are more likely to be false positives. The ortholog data is next mapped to integrated interactome, known as interologs. We finally predicted 76,585 rice protein-protein interactions among 5,049 proteins, with 2,363 interactions from Arabidopsis thaliana, 59,915 interactions from Saccharomyces cerevisiae, 5,815 interactions from Escherichia coli K12, 1,422 interactions from Caenorhabditis elegans, 3,070 interactions from Drosophila melanogaster and 8,157 interactions from Homo sapiens.
Three independent RSSGO scores were given. With a certain cut off of 0.8, larger RSSBP score indicates that two proteins having stronger correlation in biological processes; larger RSSCC score indicates that two proteins having higher similarity of cell components characteristics; a larger RSSMF score indicates that two proteins are more similar in molecular functions.
Rice subcellular localization data was obtained from the prediction of RSLpred. RSLpred is an integrated prediction server for rice subcellular localization based on four kinds of SVM modules: amino acid composition, dipeptide composition, pseudo amino acid (pseAA) composition and evolutionary information of PSI-Blast. RSLpred classified rice proteins into four 4 subcellular locations: chloroplast, cytoplasm, mitochondria and nucleus. The complete rice proteome of EBI and TIGR were predicted by RSLpred with a faster and traditional amino acid composition based module, and these two files were downloaded and combined to get the maximum coverage over our protein interaction network. Considering the transport mechanism of proteins, we did not adopt the winner-takes-all approach used in earlier studies , and all of subcellular localization sources predicted by RSLpred for a single protein were taken into annotation. If one of the localizations of a multi-localized protein were the same with its interaction partner, these two proteins are considered co-localized.
where, mean the average gene expression amount during time m, σx and σy means the standard deviation of gene expression amount during time m. The value of ɣ drops into -1 ~ 1, and -1 means gene expression patterns of A and B are opposite (a expression increased, the other down); 1means that gene expression patterns of A and B are consistent (a expression increased, the other up); 0 means that gene expression patterns of A and B are without any contact. Since interacting proteins may be mutually reinforcing (corresponding to ɣ > 0), may also be inhibited each other (corresponding to ɣ <0), so we use the absolute value of ɣ as a co-expression property between a predicted protein interaction.
We thank Peijian Cao, Fei He, Xiao Li, Kui Lin, Donglin Huang, Christian Klukas for their selfless helps. This work was supported by the NSFC [30971743, 31050110121], the MoST [2009DFA32030], and the Program for New Century Excellent Talents in University of China [NCET-07-0740].
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.