A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network
- Zhu-Hong You^{1, 2, 3},
- Zheng Yin^{3},
- Kyungsook Han^{4},
- De-Shuang Huang^{1}Email author and
- Xiaobo Zhou^{3}Email author
DOI: 10.1186/1471-2105-11-343
© You et al; licensee BioMed Central Ltd. 2010
Received: 24 December 2009
Accepted: 24 June 2010
Published: 24 June 2010
Abstract
Background
Genetic interaction profiles are highly informative and helpful for understanding the functional linkages between genes, and therefore have been extensively exploited for annotating gene functions and dissecting specific pathway structures. However, our understanding is rather limited to the relationship between double concurrent perturbation and various higher level phenotypic changes, e.g. those in cells, tissues or organs. Modifier screens, such as synthetic genetic arrays (SGA) can help us to understand the phenotype caused by combined gene mutations. Unfortunately, exhaustive tests on all possible combined mutations in any genome are vulnerable to combinatorial explosion and are infeasible either technically or financially. Therefore, an accurate computational approach to predict genetic interaction is highly desirable, and such methods have the potential of alleviating the bottleneck on experiment design.
Results
In this work, we introduce a computational systems biology approach for the accurate prediction of pairwise synthetic genetic interactions (SGI). First, a high-coverage and high-precision functional gene network (FGN) is constructed by integrating protein-protein interaction (PPI), protein complex and gene expression data; then, a graph-based semi-supervised learning (SSL) classifier is utilized to identify SGI, where the topological properties of protein pairs in weighted FGN is used as input features of the classifier. We compare the proposed SSL method with the state-of-the-art supervised classifier, the support vector machines (SVM), on a benchmark dataset in S. cerevisiae to validate our method's ability to distinguish synthetic genetic interactions from non-interaction gene pairs. Experimental results show that the proposed method can accurately predict genetic interactions in S. cerevisiae (with a sensitivity of 92% and specificity of 91%). Noticeably, the SSL method is more efficient than SVM, especially for very small training sets and large test sets.
Conclusions
We developed a graph-based SSL classifier for predicting the SGI. The classifier employs topological properties of weighted FGN as input features and simultaneously employs information induced from labelled and unlabelled data. Our analysis indicates that the topological properties of weighted FGN can be employed to accurately predict SGI. Also, the graph-based SSL method outperforms the traditional standard supervised approach, especially when used with small training sets. The proposed method can alleviate experimental burden of exhaustive test and provide a useful guide for the biologist in narrowing down the candidate gene pairs with SGI. The data and source code implementing the method are available from the website: http://home.ustc.edu.cn/~yzh33108/GeneticInterPred.htm
Background
Genetic interaction analysis, in which two mutations have a combined effect not exhibited by either mutation alone, can reveal functional relationship between genes and pathways, and thus have been used extensively to shed light on pathway organization in model organisms [1, 2]. For example, proteins in the same pathway tend to share similar synthetic lethal partners [3]. Given a pair of genes, the number of common genetic interaction partners of these two genes can be used to calculate the probability that they have physical interaction or share a biological function. Therefore, identifying gene pairs which participate in synthetic genetic interaction (SGI) is very important for understanding cellular interaction and determining functional relationships between genes. Usually, SGI includes synthetic lethal (SL, where simultaneous mutation, usually deletion, on both genes causes lethality while mutation on either gene alone does not) and synthetic sick (SS, where simultaneous mutation of two genes causes growth retardation) interactions. However, so far little is known about how genes interact to produce more complicated phenotypes like the morphological variations.
Recently, modifier screening such as synthetic genetic arrays (SGA) has been applied to experimentally test the phenotype of all double concurrent perturbation to identify whether gene pairs have SGI [3]. Although high-throughput SGA technology has enabled systematic construction of double concurrent perturbation in many organisms, it remains difficult and expensive to experimentally map out pairwise genetic interactions for genome-wide analysis in any single organism. For example, the genome of S. cerevisiae includes about 6,275 genes. About 18 million double mutants need to be tested if the analysis is carried out based on their combinatorial nature. This number will expand to about 200 million for the simple metazoan C. elegans (with ~20,000 genes), posing insurmountable technical and financial obstacles.
Therefore, many computational methods for predicting SGI have been proposed in previous works in order to alleviate the experimental bottleneck [4, 5]. A promising solution is to predict the SGI by integrating various types of available proteomics and genomics data. Candidate gene pairs with SGI are computationally predicted and validated experimentally. In [4], SS or SL gene pairs in S.cerevisiae are successfully predicted, with 80% of the interactions being discovered by testing 20% of all possible combinations of gene pairs. Various supervised algorithms, such as artificial neural network, SVM and decision tree, have been developed to tackle the synthetic genetic interaction prediction problem [4, 6]. In spite of being able to handle large input spaces and deal with noisy samples in an efficient and robust way, a main difficulty facing all supervised methods is that they predict the SGI only from labelled samples and the learning process heavily relies on the quality of the training dataset [7]. For example, in [4] about 519,647 experimentally tested gene pairs of S. cerevisiae are adopted as training dataset, which is impossible in most cases.
Usually, obtaining labelled samples is much more difficult than getting unlabelled samples. When the size of available training set is small, traditional approaches based on supervised learning may fail. Worse still, experiment-supported genetic interactions gene pairs are far more less in metazoans than in S. cerevisiae, thus it is more difficult in metazoans to generate genome-wide predictions by using supervised algorithms. Therefore, it is desirable to develop a predictive learning algorithm using both labelled and unlabelled samples. In this context, it becomes natural that semi-supervised classifiers are employed. SSL classifier uses available label information as well as the wealth of unlabelled data as the input vector. We propose a graph-based SSL method, previously presented in [8], in the context of SGI prediction. One advantage of SSL is the compatibility to small training sets, thus it could have great potentials in organisms, especially metazoans with less experiment-supported genetic interaction gene pairs. We concentrate on graph-based method due to their solid mathematical background, as well as the close relationship with kernel methods and model visualization.
In recent years, it has been a growing and hot topic to combine information from diverse genomic or proteomic evidence in order to arrive at accurate and holistic network [9–13]. The heterogeneous data sources, in one way or the other, carry interaction information reflecting different aspects of gene associations and their function relationships. Therefore, one of the major challenges is to integrate these data sources and obtain a system level view on functional relationships between genes [14]. The successful applications have proved that an integration of heterogeneous types of high-throughput biological data can improve the accuracy of the groupings compared with any single dataset alone [10, 15–19]. However, despite the success of integrated networks in other area [10], most previous works on synthetic genetic interaction prediction mainly consider PPI or gene expression data alone [20–22].
In this study, we integrate PPI, protein complex and gene expression data simultaneously to utilize more information for more accuracy of genetic interaction prediction taking the following observations into consideration. PPI data is believed to contain valuable insight for the inner working of cells. Therefore, it may provide useful clues for the function of individual protein or signalling pathways [23, 24]. Although it is unclear which proteins are in physical contact, protein complexes include groups of proteins perform a certain cellular task together and contain rich information about functional relationships among the involved proteins. The high-throughput gene expression profiles are becoming essential resources for systems-level understanding of genetic interaction [25–28]. Gene expression profiles measure the expression levels of certain genes in genome scale. Relative to randomly paired genes, functionally interacting genes are more likely to have similar expression patterns and phenotypes [5, 29, 30]. It is assumed that genes with similar expression profiles are involved in the control of the same transcriptional factors and thus they are functionally associated [25, 31].
Network analysis is a quantitative method originating from the social science; it studies the nodes' topology properties related to its connectivity and position in the network. It has become increasingly popular in diverse areas, especially in molecular biology and computational biology [9, 32]. Network analysis is a powerful tool for studying the relationships between two nodes in a network. It has been proved in recent work that genetic interactions are more likely to be found among proteins that are highly connected and highly central in protein interaction network [33]. This finding demonstrates the correlations between topological properties of PPI network and SGI between proteins. In this study, we study the extent to which pairwise SGI can be predicted from the topological properties of the corresponding proteins in a FGN.
In previous works, they only consider the topological properties of the binary protein interaction network while ignore the underlying functional relationships which can be reflected by the gene expression profile [4, 20]. A major limitation of these methods stems from the fact that the weight of ties is not taken into account. For FGN, the weights often reflect the function similarity performed by the ties. Exploring the information that weights hold allows us to further our understanding of networks [34, 35]. In this paper, we also present a straightforward generalization of a number of weighed network properties which originally defined on the unweighted networks. Concretely, the weighted network properties are defined by combining weighted and topological observables that enable us to characterize the complex statistical properties and heterogeneity of the actual weight of edges and nodes. This information allows us to investigate the correlations among weighted quantities and the underlying topological structure of the network. The topological properties of the FGN are examined with the aim of discovering the relationship between the network properties of gene pairs and the existence of a SGI relationship.
Results
General approach
We can see from Fig. 1 that PPI data, protein complex data, and gene expression profiles are integrated to build a high coverage and high precision weighted FGN. More specifically, PPI and protein complex data are used to determine the topology of the network. Then a clustering analysis method is utilized to identify functionally related groups from the gene expression profile and the weights of the interaction are calculated based on the gene expression profile and clustering centroids, i.e. the weight of a PPI network derives from a metric considering the distance of expression of individual gene and the centroids of its cluster, as well as the distance between the two cluster centroids themselves. The weights are assigned as the confidence scores which represents their functional coupling. Considering weights of interactions instead of binary linkage information allows more accurate modelling and will have better classification performance [15, 17].
And then, a set of topological properties are extracted from the FGN. These network properties and the experimentally obtained gene pairs which have been confirmed to have or do not have the synthetic genetic interaction are considered as an input vector of a SSL classifier to predict other unknown interacting gene pairs. Concretely, we use a SSL classifier to model correlations between network properties and the existence of a SGI. The output labels of the SSL classifier are soft labels y_{ i }∈ [0, 1], which measure if the two corresponding genes participate in a SGI. The details of above procedure are described in the method section.
Cross validation
Performance comparisons are based on the following Cross Validation (CV) procedures. CV is a way of choosing proper benchmarking samples to assess the accuracy and validity of a statistical model. Specifically, we randomly select 1,500 known SGI pairs and 1,500 non-SGI pairs from the dataset provided by Tong et al [3]. Thus, the sampled dataset contain an equal number of SGI and non-SGI gene pairs. In n - fold CV, we randomly divide the known SGI pairs into n subsets of approximately equal size. Equal number of non-SGI pairs corresponding to above n divided subsets are randomly selected and assigned to the n subsets. Then n - 1 such subsets are combined for training the classifier, which is subsequently tested on all other SGI and non-SGI pairs from the withheld subset. This procedure is repeated n times with each subset playing the role of the test subset once.
We use the standard Receiver Operating Curve (ROC) to assess performance overall. We compute the sensitivity (or true-positive rate, defined here as the fraction of SGI gene pairs correctly predicted) and false-positive (defined here as the fraction of non-SGI gene pairs incorrectly predicted to be SGI) by decreasing stringency levels of the classifier (outputs soft labels). By using alternative score thresholds, this approach can be tuned to predict a subset of SGI with higher confidence at a small cost of sensitivity.
Experiment results
SVM has emerged as one of the most popular supervised approaches with a wide range of applications. In particular, the previous studies have demonstrated that SVM has better learning performance and accuracy than other supervised algorithms, such as Artificial Neural Network and Decision Trees [36]. Therefore, in this study we implemented our graph-based SSL algorithm and compared it with the SVM in distinguishing SGI versus non-SGI gene pairs on the same benchmark dataset. We test the capability of our method using different levels of sparsity of training set. In the experiment, 80% (5-fold CV), 50% (2-fold CV), and 20% of the known SGI and non-SGI gene pairs are randomly chosen for training the classifier respectively, which was subsequently tested on all other SGI and non-SGI gene pairs from the withheld group (This is repeated several times with each group playing the role of the test group at least one time). Since the gene pairs to be classified for cross-validation are randomly chosen, we repeated each experiment five times and computed the average of all the results.
Discussion
The statistics of network properties for SGI vs. non-SGI gene pairs
Gene pair characteristic | KSStat | P-value | |||
---|---|---|---|---|---|
Binary Network | Weighted Network | Binary Network | Weighted Network | ||
Average | Degree | 0.0364 | 0.0261 | 0.9 | 0.0011 |
Closeness | 0.0212 | 0.0385 | 0.0108 | 1.48E-07 | |
Betweenness | 0.0319 | 0.0529 | 1.55E-05 | 7.05E-14 | |
Clustering Coefficient | 0.0679 | 0.0691 | 1.41E-23 | 2.365E-24 | |
Absolute Difference | Degree | 0.0587 | 0.0715 | 1.00E-17 | 5.71E-25 |
Closeness | 0.0587 | 0.0441 | 1.00E-17 | 9.365E-10 | |
Betweenness | 0.0313 | 0.0414 | 2.35E-05 | 1.56E-08 | |
Clustering Coefficient | 0.0615 | 0.0571 | 1.97E-19 | 4.435E-26 |
Conclusions
In conclusion, a SSL prediction approach was proposed in this paper to predict SGI by combining functional and topological properties of FGN. Using a clustering-based data integration method, large-scale protein interaction data, protein complex data and multiple time-course gene expression datasets were combined in order to build FGN in yeast. Greater coverage and higher accuracy were achieved in comparison with previous high-throughput studies of PPI networks in yeast. Then, we show that topological properties of protein pairs in a FGN can be served as compelling and relatively robust determinants for the existence of synthetic genetic interaction between them. Finally, a graph-based SSL is utilized as a classifier to model correlations between FGN properties and the existence of a synthetic genetic interaction.
Our results clearly demonstrate that the proposed algorithm can achieve better performance comparing with previous methods. Our framework of feature representation is a general form, and it is straightforward to add other topological properties that are relevant to this problem. It is also possible to add other types of biological evidences. For example, information about the function of proteins can be encoded in our framework as well. We hope to extend this work and improve feature representation in future so that we can detect other types of interaction groups.
Methods
Biological datasets
There are four different types of data sets used in the study. 1) Golden standard dataset of known genetic interactions (True positives, TPs) and non-interacting protein pairs (True negatives, TNs). 2) Experimental protein-protein interaction data. 3) Experimental protein complex data. 4) Time-lapse gene expression profiles.
Golden standard genetic interaction dataset
Using the Synthetic Genetic Array (SGA) technology, Tong et al. screened 132 query strains (carrying mutations in genes with diverse functions in cell polarity, cell wall biosynthesis, chromosome segregation and DNA synthesis and repair) against the complete library of ~4700 viable haploid deletion strains, and ~650,000 gene pairs were experimentally tested and identified a total of ~4,000 synthetic lethal synthetic sick interactions, at 0.65% frequency [3]. We used this dataset as golden standard dataset to investigate synthetic genetic interaction in S. cerevisiae.
Protein-protein interaction dataset
To computer network properties associated with protein-protein interaction in S. Cerevisiae, we download protein interaction data from the BioGrid database [37]. This network contains 12,990 unique interactions among 4,478 proteins.
Protein complexes dataset
For protein complex, we assigned binary interactions between any two proteins participating in a complex. Thus in general, if there are n proteins in a protein complex, we add n(n - 1)/2 binary interactions. We get the protein complex data from [38, 39]. Altogether about 49,000 interactions are added to the protein interaction network.
Microarray gene expression data
Four sets of time course data from the DNA microarray of S. cerevisiae are used in this study. These datasets have also been used to study the genetic interactions in previous work [40]. The first set contains 17 time points during the mitotic cell cycle [41]. The second set contains 6 time points during heat shock and the third set contains 9 time points during sporulation [31], and the fourth set contains 32 time points during cell cycle [42]. Altogether 64 experimental conditions for all the genes in S. cerevisiae related to cell cycle are used. For the missing values in each experiment, we substituted its gene expression ratio to the reference state with the average ratio of all the genes under that specific experimental condition.
Construction of functional gene networks
Linkages of the FGN carry confidence scores to represent the functional coupling between two biological entities they represent. In this section, we calculated the confidence score of each linkage following the previous works [25, 26].
For the gene expression data, the clustering analysis is carried out to identify functionally related groups of genes. We denote a gene expression data set as X = {x_{1}, x_{2},...,x_{ M }}, where x_{ i }= {x_{i 1}, x_{i 2},...,x_{ iN }} is a N dimensional vector representing gene i with N conditions. We use the clustering algorithm to group the M genes into S,(S ≤ M -1) different clusters C_{1}, C_{2},...,C_{ s }.
where x_{ ik }and x_{ jk }are the expression values of the kth condition of the ith and jth genes respectively. and are the mean values of the ith and jth genes respectively. PCC is always in the range of (-1, 1).
At first, all genes of the gene expression profiles are considered as a single cluster and the cluster is partitioned into two disjoint clusters. Partitioning is done in such way that x_{ i }and x_{ j }which have the most negative value of PCC will be assigned into two different clusters. Genes having larger PCC value with x_{ i }compared with x_{ j }are assigned in the cluster that contains x_{ i }. Otherwise, they are placed in the cluster that contains x_{ j }. In the next iteration, a cluster having a gene pair (x_{ i }, x_{ j }) with the most negative PCC value will be selected and the above partitioning procession is repeated until there is no negative PCC value present between any pair of genes inside any cluster. This kind of cluster method ensures that all pairs of genes in any cluster are only positively correlated. It has been proven that this method is able to obtain clusters with higher biological significance than that obtained by some other algorithms such as Fuzzy K-means, GK and PAM clustering methods [43].
where x_{ i }and x_{ j }represent genes i and j with N conditions respectively. and denote the centroids of the clusters in which genes x_{ i }and x_{ j }located respectively. ||·||^{2} denotes the Euclidean distance. In equation (2), the constant L_{1} is a tradeoff parameter used to tune the ratio of the first and second term in the weight function. According to [44], we choose L_{1} = 0.3 because we assume that the distance between centroids of two cluster more significant comparing with the distance of each gene from its centroid. The outcome of the integration method is a weighted undirected graph, i.e. functional gene network.
The properties of functional gene network for predicting SGI
Features for representing synthetic interaction
Gene pair characteristic | Reference | Graph Type | |
---|---|---|---|
1 | Centrality degree | Barrat et al. (2004) | Weight |
2 | Clustering coefficient | Barrat et al. (2004). | Weight |
3 | Betweenness centrality | Brandes. (2001) | Weight |
4 | Closeness centrality | Newman. (2001) | Weight |
5 | Eigenvector centrality | Csardi G. (1965) | Weight |
6 | Stress centrality | Freeman LC. (1977) | Binary |
7 | Information centrality | Stephenson K. (1989) | Binary |
8 | Shortest path length | Newman. (2001) | Weight |
9 | Flow between centrality | Newman. (2001) | Binary |
10 | Mutual neighbor | Newman. (2001) | Binary |
(1) Centrality degree
where ω is the weight between two nodes, in which ω_{ ij }is greater than 0 if node v_{ i }is tied to node v_{ j }, and the value is the weight of the tie, which represents the strength of the relation between the two nodes.
(2) Clustering coefficient
(3) Weighted Shortest Path
Both the closeness centrality and betweenness centrality rely on the calculation of shortest path in a network. Therefore, a first step towards extending these measures to weighted networks is to generalize how shortest path is defined in weighted networks.
In weighted networks, the shortest path is a path between two nodes with the minimal sum of the weights of its constituent edges. Since all edges have the same weight in unweighted networks, the shortest path between two nodes is through the smallest number of intermediary nodes. However, a complication arises when the ties in a network do not have the same weight attached to them. There have been several attempts to calculate shortest distances in weighted networks in previous work [46, 47]. In our work, we applied Dijkstra's algorithm to the weighted biological network by inverting the positive weights in the network [47]. Thus, high values represent weak ties, whereas low values represent strong ties.
(4) Betweenness centrality
where g_{ jk }is the number of shortest geodesic paths from node v_{ j }to v_{ k }. g_{ jk }(i) is the number of shortest geodesic paths from v_{ j }to v_{ k }which pass through the node v_{ i }.
In the case of weighted network, we assume that the flow in the network occurs over the paths that Dijkstra's algorithm identifies and use this algorithm to find the nodes that funnel the flow in the network. Then the weighted betweenness centrality is extended by counting the number of paths found by Dijkstra's algorithm on a weighted network instead of the number found on a binary network [48].
(5) Closeness centrality
where n is the total number of nodes in the network.
(6) Eigenvector centrality
Eigenvector centrality is a measure of the importance of a node in a network. It assigns relative scores to all nodes in the network based on the principle that connections to high-scoring nodes contribute more to the score of the node in question than equal connections to low-scoring nodes.
Hence we see that x is an eigenvector of the adjacency matrix with eigenvalue λ. In our work, we used the free software package named igraph to calculate the eigenvector centrality of weighted network [50].
In addition to above six weighted network properties, we also calculated several other binary network properties, such as stress centrality [51], information centrality [52], flow betweenness centrality [53], the number of mutual neighbors between proteins v_{ i }and v_{ j }. All of the above ten network properties can reflect the local network structure around the node or the global network topology.
Graph-based semi-supervised classifier
The SSL is halfway between supervised and unsupervised learning, which is very active and has recently attracted a considerable amount of research [7, 54]. In essence, there are three different kinds of SSL algorithms being applied, i.e., Generative models, Low density separation algorithms, and Graph-based methods. In our study, we use graph-based SSL method because of its solid mathematical background, their relationship with kernel methods, visualization, and good results in many areas, such as computational biology [32], web page classification [54], or hyperspectral image classification [7]. We here present the whole formulation of the graph-based SSL algorithm.
Consider the whole dataset being represented by χ = (χ_{ l }, χ_{ n }) of labelled inputs χ_{ l }= {x_{1}, x_{2},...,x_{ l }} and unlabelled inputs χ_{ n }= {x_{l+1}, x_{l+2},...,x_{ n }} along with a small portion of corresponding labels {y_{1}, y_{2},...,y_{ l }}. Consider a connected weighted graph G = (V, E) with vertex V corresponding to above n data points, with nodes L = {1, 2,...,l} corresponding to the labelled points with labels y_{1}, y_{2},...,y_{ l }and U = {l + 1, l +2,...,n} corresponding to unlabelled points. For SSL, the objective is to infer the labels {y_{l+1}, y_{l+2},...,y_{ n }} of the unlabelled data {x_{l+1}, x_{l+2},...,x_{ n }}, typically l ≪ n.
where x_{ i }and x_{ j }denote the different points in the graph G. The constant σ is a length scale hyperparameter. Therefore nearby points in Euclidean spaces are assigned large edge weight, and vice versa.
Then let F denotes a series of n × l matrices with non-negative elements. A matrix corresponds to one certain classification on χ = (χ_{ l }, χ_{ n }) by assigning each point x_{ i }a label y_{ i }= argmax x_{j≤l}.F_{ ij }. We define an n × l matrix Y ∈F with Y_{ ij }= 1 if x_{ i }is labelled as y_{ i }= j and Y_{ ij }= 0 otherwise.
Secondly, we build the matrix S = D^{-1 2}WD^{-1 2} where D is a diagonal matrix with the (i, i) -elements equal to the sum of the ith row of W. Then take the iteration F(t + 1) = αSF(t) + (1 - α)Y until the similarity matrix F converges, where α is a predefined constant which ranges from 0 to 1.
Then the classification matrix can be calculated as: F* = (1 - αS)^{-1}Y. As in [8], F* can be obtained without iteration. After the above steps, the labels of unlabelled data {x_{l+1}, x_{l+2},...,x_{ n }} will be assigned.
Support vector machines classifier
SVM algorithm has been proposed by Vapnik as an effective and increasingly popular learning approach for solving two-class pattern recognition problems [55]. SVM as a typical supervised machine learning method is attractive because it is not only well founded theoretically, but also superior in practical applications. Intuitively, SVM classifier is based on the structure risk minimization principle for which error bound analysis has been theoretically motivated. The method is defined over a vector space where the problem is to find a decision surface that "best" separates the data points in two classes by finding a maximal margin. SVM has been widely applied to a number of pattern recognition areas like text categorization [56], object recognition [57], etc. In most of these cases, the performance of SVM is significantly better than that of other supervised machine learning methods, including Neural Network and Decision Tree classifier [17]. The SVM has a number of advanced properties, including the ability to handle large feature space, effective avoidance of overfitting, and information condensing for the given data set, etc. A brief introduction about SVM is given in the Additional file 1.
Here, we describe the use of the LibSVM provided by Chih-Chung Chang. LibSVM is an integrated software for support vector classification [58]. It is much easy to construct a SVM classifier. We only need to choose a kernel function and regularization parameter to train the SVM. In this study, we adopt the radial basis function (RBF) as the kernel function whose parameters were optimized by taking a n-fold cross-validation on the training set [55]. Specifically, the grid search was used to find optimal kernel parameters such as C, Gamma, which tries values of each parameter across the specified search range using geometric steps. Although grid search method is computationally expensive, it is computationally feasible in our cases.
Declarations
Acknowledgements
Dr. Zhou is partially supported by NIH R01LM010185-01 and NIHR01CA121225-01A2. This work was supported by National Science Foundation of China (NSFC) under Grant No. 30900321, 30700161 and 60973153, and the China Postdoctoral Science Foundation (Grant No. 20090450825), and the Knowledge Innovation Program of the Chinese Academy of Science (Grant No. 0823A16121). We thank the anonymous reviewers for their helpful comments and suggestions.
Authors’ Affiliations
References
- Hartman JLt, Garvik B, Hartwell L: Principles for the buffering of genetic variation. Science 2001, 291(5506):1001–1004. 10.1126/science.291.5506.1001View ArticlePubMedGoogle Scholar
- Kelley R, Ideker T: Systematic interpretation of genetic interactions using protein networks. Nature Biotechnology 2005, 23(5):561–566. 10.1038/nbt1096View ArticlePubMedPubMed CentralGoogle Scholar
- Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al.: Global mapping of the yeast genetic interaction network. Science 2004, 303(5659):808–813. 10.1126/science.1091317View ArticlePubMedGoogle Scholar
- Wong SL, Zhang LV, Tong AH, Li Z, Goldberg DS, King OD, Lesage G, Vidal M, Andrews B, Bussey H, et al.: Combining biological networks to predict genetic interactions. Proc Natl Acad Sci USA 2004, 101(44):15682–15687. 10.1073/pnas.0406614101View ArticlePubMedPubMed CentralGoogle Scholar
- Zhong W, Sternberg PW: Genome-wide prediction of C. elegans genetic interactions. Science 2006, 311(5766):1481–1484. 10.1126/science.1123287View ArticlePubMedGoogle Scholar
- Onami S, Kitano H: Genome-wide prediction of genetic interactions in a metazoan. Bioessays 2006, 28(11):1087–1090. 10.1002/bies.20490View ArticlePubMedGoogle Scholar
- Camps-Valls G, Marsheva TVB, Zhou DY: Semi-supervised graph-based hyperspectral image classification. Ieee Transactions on Geoscience and Remote Sensing 2007, 45(10):3044–3054. 10.1109/TGRS.2007.895416View ArticleGoogle Scholar
- Zhou D, Bousquet O, Lal TN, Weston J, Olkopf BS: Learning with local and global consistency. Advances in Neural Information Processing Systems 16 2004, 321–328.Google Scholar
- Lee I, Date SV, Adai AT, Marcotte EM: A probabilistic functional network of yeast genes. Science 2004, 306(5701):1555–1558. 10.1126/science.1099511View ArticlePubMedGoogle Scholar
- You ZH, Zhang SW, Li LP: Integration of Genomic and Proteomic Data to Predict Synthetic Genetic Interactions Using Semi-supervised Learning. Emerging Intelligent Computing Technology and Applications: With Aspects of Artificial Intelligence 2009, 5755: 635–644. full_textGoogle Scholar
- Jansen R, Yu HY, Greenbaum D, Kluger Y, Krogan NJ, Chung SB, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302(5644):449–453. 10.1126/science.1087361View ArticlePubMedGoogle Scholar
- Yamanishi Y, Vert JP, Kanehisa M: Protein network inference from multiple genomic data: a supervised approach. Bioinformatics 2004, 20(Suppl 1):i363–370. 10.1093/bioinformatics/bth910View ArticlePubMedGoogle Scholar
- To CC, Vohradsky J: Supervised inference of gene-regulatory networks. Bmc Bioinformatics 2008., 9: 10.1186/1471-2105-9-2Google Scholar
- Zhao XM, Wang Y, Chen LN, Aihara K: Protein domain annotation with integration of heterogeneous information sources. Proteins-Structure Function and Bioinformatics 2008, 72(1):461–473. 10.1002/prot.21943View ArticleGoogle Scholar
- Zheng H, Wang H, Glass DH: Integration of genomic data for inferring protein complexes from global protein-protein interaction networks. IEEE Trans Syst Man Cybern B Cybern 2008, 38(1):5–16. 10.1109/TSMCB.2007.908912View ArticlePubMedGoogle Scholar
- Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D: A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA 2003, 100(14):8348–8353. 10.1073/pnas.0832373100View ArticlePubMedPubMed CentralGoogle Scholar
- Linghu B, Snitkin ES, Holloway DT, Gustafson AM, Xia Y, DeLisi C: High-precision high-coverage functional inference from integrated data sources. Bmc Bioinformatics 2008., 9: 10.1186/1471-2105-9-119Google Scholar
- Lee I, Li Z, Marcotte EM: An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae. PLoS ONE 2007, 2(10):e988. 10.1371/journal.pone.0000988View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao XM, Wang Y, Chen L, Aihara K: Protein domain annotation with integration of heterogeneous information sources. Proteins 2008, 72(1):461–473. 10.1002/prot.21943View ArticlePubMedGoogle Scholar
- Paladugu SR, Zhao S, Ray A, Raval A: Mining protein networks for synthetic genetic interactions. Bmc Bioinformatics 2008., 9: 10.1186/1471-2105-9-426Google Scholar
- Lezon TR, Banavar JR, Cieplak M, Maritan A, Fedoroff NV: Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proc Natl Acad Sci USA 2006, 103(50):19033–19038. 10.1073/pnas.0609152103View ArticlePubMedPubMed CentralGoogle Scholar
- Scott BT, Bovill EG, Callas PW, Hasstedt SJ, Leppert MF, Valliere JE, Varvil TS, Long GL: Genetic screening of candidate genes for a prothrombotic interaction with type I protein C deficiency in a large kindred. Thromb Haemost 2001, 85(1):82–87.PubMedGoogle Scholar
- Damjanovic A, Garcia-Moreno B, Lattman EE, Garcia AE: Molecular dynamics study of hydration of the protein interior. Computer Physics Communications 2005, 169(1–3):126–129. 10.1016/j.cpc.2005.03.030View ArticleGoogle Scholar
- Whitten ST, Garcia-Moreno B, Hilser VJ: Local conformational fluctuations can modulate the coupling between proton binding and global structural transitions in proteins. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(12):4282–4287. 10.1073/pnas.0407499102View ArticlePubMedPubMed CentralGoogle Scholar
- Tu K, Yu H, Li YX: Combining gene expression profiles and protein-protein interaction data to infer gene functions. J Biotechnol 2006, 124(3):475–485. 10.1016/j.jbiotec.2006.01.024View ArticlePubMedGoogle Scholar
- Segal E, Wang H, Koller D: Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 2003, 19(Suppl 1):i264–271. 10.1093/bioinformatics/btg1037View ArticlePubMedGoogle Scholar
- Tornow S, Mewes HW: Functional modules by relating protein interaction networks and gene expression. Nucleic Acids Res 2003, 31(21):6283–6289. 10.1093/nar/gkg838View ArticlePubMedPubMed CentralGoogle Scholar
- Xiao G, Pan W: Gene function prediction by a combined analysis of gene expression data and protein-protein interaction data. J Bioinform Comput Biol 2005, 3(6):1371–1389. 10.1142/S0219720005001612View ArticlePubMedGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12(1):37–46. 10.1101/gr.205602View ArticlePubMedPubMed CentralGoogle Scholar
- Greenbaum D, Jansen R, Gerstein M: Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 2002, 18(4):585–596. 10.1093/bioinformatics/18.4.585View ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863View ArticlePubMedPubMed CentralGoogle Scholar
- Aittokallio T, Schwikowski B: Graph-based methods for analysing networks in cell biology. Brief Bioinform 2006, 7(3):243–255. 10.1093/bib/bbl022View ArticlePubMedGoogle Scholar
- Kafri R, Dahan O, Levy J, Pilpel Y: Preferential protection of protein interaction network hubs in yeast: Evolved functionality of genetic redundancy. Proceedings of the National Academy of Sciences of the United States of America 2008, 105(4):1243–1248. 10.1073/pnas.0711043105View ArticlePubMedPubMed CentralGoogle Scholar
- Lubovac Z, Gamalielsson J, Olsson B: Combining functional and topological properties to identify core modules in Protein Interaction Networks. Proteins-Structure Function and Bioinformatics 2006, 64(4):948–959. 10.1002/prot.21071View ArticleGoogle Scholar
- Schormann N, Senkovich O, Walker K, Wright DL, Anderson AC, Rosowsky A, Ananthan S, Shinkre B, Velu S, Chattopadhyay D: Structure-based approach to pharmacophore identification, in silico screening, and three-dimensional quantitative structure-activity relationship studies for inhibitors of Trypanosoma cruzi dihydrofolate reductase function. Proteins-Structure Function and Bioinformatics 2008, 73(4):889–901. 10.1002/prot.22115View ArticleGoogle Scholar
- Caruana R, Niculescu-Mizil A: An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd international conference on Machine learning 2006, 148: 161–168. full_textGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, (34 Database):D535–539. 10.1093/nar/gkj109Google Scholar
- Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, et al.: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 2002, 415(6868):180–183. 10.1038/415180aView ArticlePubMedGoogle Scholar
- Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, et al.: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002, 415(6868):141–147. 10.1038/415141aView ArticlePubMedGoogle Scholar
- Hakamada K, Hanai T, Honda H, Kobayashi T: Preprocessing method for inferring genetic interaction from gene expression data using Boolean algorithm. J Biosci Bioeng 2004, 98(6):457–463.View ArticlePubMedGoogle Scholar
- Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, et al.: A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 1998, 2(1):65–73. 10.1016/S1097-2765(00)80114-8View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9(12):3273–3297.View ArticlePubMedPubMed CentralGoogle Scholar
- Bhattacharya A, De RK: Divisive Correlation Clustering Algorithm (DCCA) for grouping of genes: detecting varying patterns in expression profiles. Bioinformatics 2008, 24(11):1359–1366. 10.1093/bioinformatics/btn133View ArticlePubMedGoogle Scholar
- Maraziotis IA, Dimitrakopoulou K, Bezerianos A: Growing functional modules from a seed protein via integration of protein interaction and gene expression data. Bmc Bioinformatics 2007., 8: 10.1186/1471-2105-8-408Google Scholar
- Barrat A, Barthelemy M, Pastor-Satorras R, Vespignani A: The architecture of complex weighted networks. Proceedings of the National Academy of Sciences of the United States of America 2004, 101(11):3747–3752. 10.1073/pnas.0400087101View ArticlePubMedPubMed CentralGoogle Scholar
- Katz L: A New Status Index Derived from Sociometric Analysis. Psychometrika 1953, 18(1):39–43. 10.1007/BF02289026View ArticleGoogle Scholar
- Dijkstra EW: A note on two problems in connexion with graphs. Numerische Mathematik 1959, 1: 269–271. 10.1007/BF01386390View ArticleGoogle Scholar
- Opsahl T, Panzarasa P: Clustering in weighted networks. Social Networks 2009, 31(2):155–163. 10.1016/j.socnet.2009.02.002View ArticleGoogle Scholar
- Newman MEJ: Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Physical Review E 2001., 6401(1):Google Scholar
- Csardi G, Nepusz T: The igraph software package for complex network research. InterJournal 2006. Complex Systems:1695 Complex Systems:1695Google Scholar
- Freeman LC: Set of Measures of Centrality Based on Betweenness. Sociometry 1977, 40(1):35–41. 10.2307/3033543View ArticleGoogle Scholar
- Stephenson K, Zelen M: Rethinking Centrality: Methods and Applications. Social Networks 1989, 11: 1–37. 10.1016/0378-8733(89)90016-6View ArticleGoogle Scholar
- Brandes U, Fleischer D: Centrality measures based on current flow. Stacs 2005, Proceedings 2005, 3404: 533–544. full_textView ArticleGoogle Scholar
- Liu R, Zhou JZ, Liu M: A graph-based semi-supervised learning algorithm for web page classification. ISDA 2006: Sixth International Conference on Intelligent Systems Design and Applications 2006, 2: 856–860. full_textGoogle Scholar
- Cortes C, Vapnik V: Support-Vector Networks. Mach Learn 1995, 20(3):273–297.Google Scholar
- Drucker H, Wu DH, Vapnik VN: Support vector machines for spam categorization. Ieee T Neural Networ 1999, 10(5):1048–1054. 10.1109/72.788645View ArticleGoogle Scholar
- Pontil M, Verri A: Support Vector Machines for 3 D object recognition. Ieee T Pattern Anal 1998, 20(6):637–646. 10.1109/34.683777View ArticleGoogle Scholar
- Chang C-C, Lin C-J: LIBSVM: a library for support vector machines.2001. [http://www.csie.ntu.edu.tw/~cjlin/libsvm/]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.