Assessing the functional coherence of modules found in multiple-evidence networks from Arabidopsis
© Lysenko et al; licensee BioMed Central Ltd. 2011
Received: 20 August 2010
Accepted: 25 May 2011
Published: 25 May 2011
Combining multiple evidence-types from different information sources has the potential to reveal new relationships in biological systems. The integrated information can be represented as a relationship network, and clustering the network can suggest possible functional modules. The value of such modules for gaining insight into the underlying biological processes depends on their functional coherence. The challenges that we wish to address are to define and quantify the functional coherence of modules in relationship networks, so that they can be used to infer function of as yet unannotated proteins, to discover previously unknown roles of proteins in diseases as well as for better understanding of the regulation and interrelationship between different elements of complex biological systems.
We have defined the functional coherence of modules with respect to the Gene Ontology (GO) by considering two complementary aspects: (i) the fragmentation of the GO functional categories into the different modules and (ii) the most representative functions of the modules. We have proposed a set of metrics to evaluate these two aspects and demonstrated their utility in Arabidopsis thaliana. We selected 2355 proteins for which experimentally established protein-protein interaction (PPI) data were available. From these we have constructed five relationship networks, four based on single types of data: PPI, co-expression, co-occurrence of protein names in scientific literature abstracts and sequence similarity and a fifth one combining these four evidence types. The ability of these networks to suggest biologically meaningful grouping of proteins was explored by applying Markov clustering and then by measuring the functional coherence of the clusters.
Relationship networks integrating multiple evidence-types are biologically informative and allow more proteins to be assigned to a putative functional module. Using additional evidence types concentrates the functional annotations in a smaller number of modules without unduly compromising their consistency. These results indicate that integration of more data sources improves the ability to uncover functional association between proteins, both by allowing more proteins to be linked and producing a network where modular structure more closely reflects the hierarchy in the gene ontology.
The ever-increasing availability of high-volume proteomic, genomic and transcriptomics datasets has led to multiple studies aimed at the systems-level interpretation of this information using biological networks and relationship networks. Biological networks are graphs where the nodes are molecules and edges indicate interactions between them [1, 2]. As explained in , in this type of network an allowance can be made for "suppression of detail", e.g. the intermediate components of some interactions may be omitted and instead represented by an edge. Most commonly this type of abstraction is used to represent gene regulation, where the DNA-protein interaction, transcription and translation are represented by just one edge between the regulator and its target protein. Relationship networks  are a superset of biological networks, where there is no longer a restriction that an edge must represent an actual real-life process that links the two molecules, but instead may indicate a shared property, such as two proteins having the same type of protein domain or being mentioned in the same publication.
The types of data used for construction of such networks include, but are not limited to, sequence similarity , shared sequence features [5, 6], genetic interactions [6–10], gene co-expression [5, 6, 11–14], protein-protein interaction [5–7, 11, 15–18], domain interaction [19, 20] and term co-occurrence in the scientific literature [3, 5, 10, 11, 21]. These types of information can be analysed independently or integrated together in order to encompass a wider range of biological mechanisms, provide additional evidence of association between entities in the network and connect disjoint parts of the network. In these studies, different techniques have been developed for the analysis of relationship networks, but they follow similar approaches: partitioning the network into modules, identifying the graph-theoretic properties of the network and relating these to biological function. In this work we have adopted a similar approach and have devised a set of metrics for quantifying the functional coherence of the modules in order to explore the effect of using multiple evidence-types in an integrated relationship network of Arabidopsis thaliana proteins.
Clustering approaches work by identifying densely interconnected areas within a network  and are commonly used to detect modular structure in graphs. In the context of biologically-relevant networks, these groups are often referred to as functional modules [2, 7]. Functional modules in biological networks are groups of molecules that are more linked to the other members of the group than to non-members and have similar function . The modular structure can be used to infer function of as yet unannotated proteins , to discover previously unknown roles of proteins in diseases  as well as for better understanding of the regulation and interrelationship between different elements of complex biological systems . The function of a module is commonly identified from the annotation of its members with respect to the Gene Ontology (GO) .
GO consists of three separate categories - Biological Process, Molecular Function and Cellular Component, where each category consists of a controlled vocabulary of terms structured as a directed acyclic graph with qualified edges describing the semantic relationship between these terms. Each protein can be annotated with multiple GO terms and inherits the annotation of the parent terms and this makes it challenging to quantify and analyse the functional similarity between GO annotations. This has stimulated a number of studies that have explored these problems in detail. In particular, the importance of quantitative characterisation of GO term specificity (information content, IC) was demonstrated by  and based on this metric, several pair-wise quantitative measurements were developed that take into account the structure and properties of the Gene Ontology (reviewed in ). In a related set of efforts, a number of metrics were also designed to measure semantic consistency of protein sets. Original approaches were designed to identify the functional annotations for which a group of proteins was significantly enriched and did not take into account hierarchical structure of GO . Although useful, these methods have a number of limitations, which were discussed in detail in the following studies [26, 27]. To address these shortcomings, a number of extensions were proposed that combine some aspects of enrichment-based methods with adjustments for the relationship between the terms [28–30]. At the same time, another set of measures was also developed for the quantification of overall relatedness of annotations, rather than their overrepresentation within the set [27, 31–34]. In this study we have drawn upon the insights that have emerged from this work in order to define a descriptive measure for comparison of functional annotation of protein sets.
We claim that to determine the biological relevance of the partitioning of a set of proteins there are two important aspects that need to be taken into consideration. The first is that the set of GO terms that best describes the common function of a representative proportion of proteins in the modules can be found at any annotation specificity level. However, at the higher levels, which are close to the root of the Gene Ontology, the annotation will not be particularly informative. This leads to a trade-off between the specificity of annotation terms and the number of proteins in the module to which it applies. The needs of the particular application case may dictate which of these two components is more important, and metrics have been developed that allow the emphasis to be placed on one or the other . Using the metric defined in this paper (AIC-MICA), we were able to explore these two properties in five different relationship networks. The second aspect to be considered is that the sets of proteins with the same GO annotation can be fragmented, i.e. assigned to a number of different clusters by the clustering algorithm. Not only can the functionally similar group be spread across a number of clusters, but also may be more or less concentrated in the clusters where it is present.
To assess the functional coherence of modules from a relationship network both of these aspects, namely the representative functions of modules and the fragmentation of functional categories, are also relevant. Here we explore the potential of combined relationship networks to recover functional modules by considering four sources of information: protein-protein interaction (PPI), co-expression (COE), sequence similarity (SEQ) and co-occurrence of terms in the scientific literature (LIT). We have also constructed a combined network (ALL), which is a union of these four networks. These evidence types were chosen because they are often used for inferring functional relationships among genes and proteins and are readily available from the application of high throughput 'omics techniques. A large amount of co-expression data are available for Arabidopsis thaliana (see for example, ). Measurements of sequence similarity can be obtained for all pairs of proteins  and co-occurrence of protein terms in abstracts can be extracted from the scientific literature . We decided to restrict the set of proteins in the network to those for which protein-protein interaction information is available, as we currently consider this to be limiting for Arabidopsis. This restriction means that we are only considering a small subset of Arabidopsis proteins, but has the advantage that it leads to a more balanced distribution of evidence types from the four information sources among the relationships between proteins. This setting allows us to evaluate to what extent patterns and trends that were previously found in whole proteome-based networks still hold in situations where only a subset of the whole proteome is analysed. Another motivation is to evaluate the usefulness of these approaches to extract the best possible information under conditions when data are scarce or incomplete.
As we have demonstrated in our previous work , by combining data from multiple resources it is possible to assemble much larger and more comprehensive integrated datasets. Several providers also offer pre-integrated datasets for Arabidopsis, such as STRING  and AtPIN . However, the integration protocols of these data sources are not always clearly described even though the protocol can affect the structure of the network .
We have used the Ondex data-integration and visualization platform  to integrate and analyse the information sources and we have verified that the dataset used for this study is sufficiently representative by performing the same type of analysis on the data held in the STRING database for the same set of proteins.
Number of edges in the graph with evidence from the four information sources after applying a threshold on the relevant strength of the relationships (as defined in the Methods section)
A comparison of graph theoretic properties for the different evidence types
Evidence Network Type
Number of connected components
Size of the largest connected component
Diameter of the largest connected component
Since the initial dataset consisted of those proteins for which interaction data was available we would expect no unconnected proteins in the PPI and ALL networks. The number of orphan proteins (i.e. unconnected) for the SEQ, COE and LIT networks were 855, 1304 and 1343 respectively. The numbers of orphan proteins, however, depend on the score thresholds chosen (refer to Methods for the values used in this study).
To explore the functional groupings of proteins in the network, we combined Arabidopsis GO annotations from three sources: IntAct , GOA-EBI  and UNIPROT . Information content-based measures (see Methods section) were used to evaluate annotation specificity.
We wished to explore (i) whether the clusters contain proteins that are generally similar in terms of their functions, as assigned by Gene Ontology terms (the most representative GO terms in a cluster) (ii) the way in which proteins with the same functional roles are distributed across different clusters (the fragmentation of GO terms).
Coverage and specificity of the most representative function of modules
The utility of clustering depends on being able to group together a large enough number of proteins, so as to facilitate exploring the modular structure of the network without diluting the information content of the clusters to such an extent that the groupings do not capture biologically meaningful relationships.
The Average Information Content of the sets of these Most Informative Common Ancestor GO terms (AIC-MICA) was used to determine the coverage and the specificity of the most representative function of modules (AIC-MICA is defined in the Methods). If a cluster contained proteins that were of very diverse function, we would expect that the GO categories corresponding to the most representative functions would not be very specific, i.e. the Most Informative Common Ancestor (MICA, see ) would be close to the root of the Ontology tree and thus would not represent a functionally meaningful grouping. Given that the links in a relationship network may not always reflect accurate functional relationships, we do not look for the MICA of all the proteins in the cluster. Instead we measure the Average Information Content (AIC) associated with a set of MICA of at least a certain coverage (percentage of all proteins in a cluster), sampled at 10% increments from 40% to 90%. This method allows simultaneous detection of functional similarities in more than one functional category and is more robust to outliers - as only a certain proportion of the proteins in the cluster need to share functional similarity in order for their ancestor GO term to be included in the set.
The STRING comparison analysis was undertaken using a complete set of information from the STRING database for the same set of 2355 proteins. The results indicate that the performance of STRING at the higher coverage (80-90%) levels was comparable to that of the ALL network. We have also considered individual evidence types from STRING (coexpression, literature and experimental PPI detection), which were found to be similar to the results obtained for the corresponding datasets constructed for this paper, if the data are interpreted as an un-weighted network. The results of this analysis can be found in the Additional File 1: additional figures and analysis.
Modules in the ALL relationship network and their most representative functions
We have explored the possibility of further post-processing the clusters produced by the MCL algorithm by looking at the average FSWeight [47, 48] of the pairs of clusters; but the results proved inconclusive. Further information about this analysis is included in the Additional File 1: additional figures and analysis. However, we did observe that the average FSWeight of edges inside clusters was significantly higher than that for edges connecting different clusters.
Fragmentation of functional categories
The other aspect that needs to be taken into consideration when assessing the functional coherence of modules is the fragmentation of functional categories. Here, we examined how the Gene Ontology terms were distributed across the clusters.
The first two rows show the average entropy for the networks and, for comparison, the average entropy for the networks with GO labels randomly permuted
Average entropy (actual network)
Average entropy (randomly permuted network)
Relative decrease in entropy (compared to randomly permuted network)
A lower entropy value implies more ordered data, both in terms of reduced fragmentation and prevalence of larger fragments. Table 3 shows the average entropy values for each network and for the corresponding control networks where cluster labels have been randomly permuted for all GO categories. To avoid the problems of small sample sizes, only those GO categories that were assigned to at least 10 proteins in the dataset were included. The ALL network has the lowest average entropy, again suggesting that it is better at grouping together related proteins, the average entropy being 2.72 compared with 3.31 for the same networks with GO labels associated with the nodes, being randomly permuted. All of the observed differences in entropy were found to be highly significant, with none of the permuted networks having an average entropy value greater than that of the real one, indicating a confidence of at least p < 0.0001. This appears to be due to the hierarchical nature of the GO categories, where every wrong assignment with respect to a child term would also lead to penalties incurred at the parent level. Therefore, the density distributions for the permuted networks were very narrow (Plot is included in the Additional File 1: additional figures and analysis).
An example of fragmentation in the ALL relationship network
Figure 4B shows the fragmentation of this cluster by visually separating all the MCL clusters across which this term is distributed. It is evident that the clustering in this case is not able to group together all the nodes that are associated with the general process 'response to hormone stimulus". In this case there were only two clusters (of size greater than 10) that have most proteins in the cluster annotated with the same term (e.g. 'response to auxin stimulus' and 'response to abscisic acid stimulus'). However, even in the situations when the grouping is suboptimal, it is still useful to be able to determine how much the grouping differs from the one specified by annotations and structure of the Gene Ontology.
In order to assess the functional coherence of modules detected by clustering relationship networks combining four commonly used data sources we have looked at the representative functions of these modules with respect to GO categories and at the fragmentation of GO categories with respect to the modules. To investigate the trade-off between coverage and specificity of the representative function of modules, we have defined the AIC-MICA metric. Additionally, two metrics describing the fragmentation of GO categories, namely BFRP and BERP, were introduced to evaluate how well the modular structure recovered by the MCL algorithm corresponds to the BP categories. These metrics look at two key aspects that relate modules in relationship networks to functional annotations. They allow us to compare the usefulness of individual data sources and the effects of combining multiple sources on the coherence of the modules.
We have found that, as expected, the SEQ network was the best for recovering very specific functional association between proteins. This was evident from the high AIC-MICA values across all coverage levels. However, an important point to note is that it may not always be desirable to extract such close groupings, and the higher level categorisation may be helpful to provide a broad overview or to help dissect very large datasets. Compared to other networks, SEQ consisted of a large number of strongly connected components (results not shown) which resulted in the relatively high overall entropy with respect to the whole Gene Ontology. We also observed that the clusters recovered were only related to a small number of GO terms. Another problem with SEQ as a sole data source is that there was insufficient evidence to link most of the proteins in our reference set. By comparison with the SEQ network, it was possible to use the ALL network to assign 721 more proteins to a cluster of size greater than one due to links that were contributed by other sources. Based on these findings, we conclude that overall there is a clear benefit from the integration of additional data sources, although there is a small cost incurred because of a reduction in functional coherence. As the ALL network performs relatively well in terms of AIC-MICA (40-90), this dilution of annotation specificity does not appear to render it uninformative. In fact, the minimum information content value that was applicable at a 40% coverage level was 0.55 and was reached only for 5 clusters found in the ALL network. This value corresponds to the 'cellular physiological process' GO term, which is one of the direct descendants of the 'biological process' root term, and is therefore very general.
To support this work, several different visualisation strategies were developed that help to summarise complex integrated networks and identify high-level patterns in them. Using these visualisation methods, we have identified that there was a hierarchically organised neighbourhood in the integrated network that was composed of the proteins annotated to the "response to hormone stimulus" GO term. This finding indicates there may be more complex and meaningful patterns than just the modules that could be identified using clustering approaches.
Comparison of graph-theoretic properties of the four networks also appears to indicate that the addition of extra edges lead to the creation of a more compact network, with smaller diameter than the COE or PPI networks. Despite this, the transitivity has remained relatively low - indicating that the number of complete cliques is small. These differences may be interpreted as an indication that, in the ALL network, potential modules are more difficult to recover and the results may be further improved using more robust clustering approaches, like spectral clustering methods . Further investigation of the impact of increasing complexity of the network versus increasing levels of noise that arise from integration of additional data sources is necessary to confirm these trends.
The co-expression (COE) network performed the worst with respect to BFRP, BERP and AIC-MICA. At first glance, this result appears to contradict several earlier studies [12, 13] where many meaningful clusters were identified in the co-expression network but this discrepancy is likely to be an artefact of the smaller subset of the proteome that was used in this case. In earlier reports using large co-expression networks, the patterns detected tended to be associated with clusters containing more than a 1000 proteins [12, 13], which are much larger than any of the modules identified in this study. This may be an indication that co-expression is a weaker source of evidence of functional similarity and more data are necessary in order to be able to make useful inferences.
In this study we have restricted the set of proteins in the network to those for which protein-protein interaction information is available, as this is a currently limiting information source for Arabidopsis. Using a larger set of proteins would have meant that the contribution of the PPI data would have been highly unbalanced in relation to other available information. Although we recognise that there are other species, in particular Saccharomyces cerevisiae, for which there is much more data available, it is also of importance to validate these types of approaches in more complex multicellular model organisms. We have also illustrated that meaningful modules can be successfully identified by clustering the integrated relationship networks even in situations when limited data are available and only part of the complete proteome is considered.
In this work we have addressed a number of important issues pertinent to the identification of functional modules in integrated relationship networks, but it is important to recognise that a number of alternative approaches exist for analysis of such networks. In particular, it is possible to weight the edges in the network based on the confidence in individual evidence types [47, 48, 50–53]. However, both the strategies of selecting optimum weights and the ways they can be meaningfully combined across heterogeneous evidence types still remains a subject of ongoing research. Another possibility is to use an alternative clustering approach for the recovery of modules. Historically, the MCL algorithm has often been applied in the context of biological networks because it offers scalable performance even with large datasets and several studies have shown that it can outperform other methods in some cases [54–56]. However, a number of other novel algorithms have now been developed, among them MCODE , MC-UPGMA  CPA  FORCE  and SPICi . A number of these approaches have also been compared in the context of PPI networks in the work by Brohée and van Helden . Further investigation into these alternative approaches has potential for future research, but was outside the scope of the present study.
Module detection in integrated biological and relationship networks is one of the most important tools for interpretation of complex biological datasets. As the amount of biological information continues to grow, it also becomes increasingly important to improve our understanding of inter-relationships within these data and, ultimately, their relationship to biological function. In this paper we have explored and quantified the integration of the several data types that are most commonly used for construction of such networks. For our datasets, we have found that combining several types of evidence was beneficial with respect to the functional annotation of modules detected using MCL clustering algorithm, that on average more closely corresponded to the functional groupings in the Biological Process aspect of GO. Although the overall level of informativeness of cluster annotation was not as good as in the sequence similarity network, it was possible to link many more proteins using additional information sources. These findings indicate that there is benefit to the integration of additional information sources, as it allows more proteins to be assigned to functional modules with only a relatively small reduction in the module annotation precision. The overall outcomes of this study provide a number of insights into the relationship between integrated networks and protein function and may be of use for further refinement of related approaches that can better capture biologically relevant information from integrated datasets.
We constructed a protein-protein interaction network based on experimentally established protein-protein interaction data from the IntAct database  and combined it with additional data, namely gene co-expression, sequence similarity and information from co-occurrence of protein names in the scientific literature. Previously we have described the approach used for constructing a combined network of PPI and gene co-expression data using Ondex . In this study we investigate the inherent modular structure of these networks and relate it to the underlying biological processes using the Gene Ontology (GO)  and quantify these properties using information content and semantic distance-based measures.
Construction of the integrated relationship network
In the network, nodes represented proteins and edges were added if there was at least one of the possible four evidence types linking these two proteins: co-occurrence of protein names in PubMed abstracts, co-expression of genes that encode those proteins (where the magnitude of the Pearson correlation coefficient is greater than 0.6), sequence similarity (with E-value < 0.0001) or experimentally determined protein-protein interaction.
We have imported protein-protein interaction (PPI) data from IntAct database (PSI-MI XML format) into the Ondex system and removed all entities that were not annotated with Arabidopsis thaliana NCBI taxonomy identifier and all entities that were not proteins. Then the interactions between multiple copies of the same protein were also discarded. All proteins that were not part of any interactions were also removed from the set.
A CO-Expression network (COE) was constructed from Arabidopsis co-expression data from the ATTED-II [63, 64] database. An edge was created in the co-expression network if the absolute value of Pearson's correlation coefficient of respective gene expression profiles was greater than 0.6.
For the literature-based co-occurrence analysis of protein names, we downloaded 30,639 abstracts from PubMed which contained the word "Arabidopsis". This set of publications together with the integrated set of Arabidopsis PPIs were loaded into Ondex. Each protein node contained a complete set of protein names and synonyms provided by TAIR and UNIPROT. The Ondex text mining plug-in was used to create relations between proteins and publications and transform the output to a co-occurrence network . An edge in the protein name co-occurrence network (LIT) indicates that there was at least one abstract that included a mention of both proteins.
Sequence similarity was determined by using TimeLogic® Tera-BLAST™ (Active Motif Inc., Carlsbad, CA) for all-against-all sequence-comparison of proteins in the interaction dataset, with an E-value cut-off at 10-3 and minimum percent sequence identity cut-off at 25%. One edge was created in a sequence-similarity network (SEQ) per pair of proteins with similar sequences.
Gene Ontology annotation
To explore the functional groupings of proteins in the network, we have combined all available Arabidopsis GO annotations from three sources: IntAct , GOA-EBI  and UNIPROT . We have calculated the Information Content (IC)  of the annotations using the combined set of all GO annotations of the Arabidopsis proteome subset as identified in the UNIPROT database. All annotations to proteins not included in the proteome set were discarded prior to calculation of the IC. The combined network of different evidence types and GO annotation is included in the additional material (Additional File 2: Integrated network).
Clustering the relationship networks
We explored the natural groupings of the proteins (nodes) using the MCL clustering algorithm . This algorithm simulates flow in the network and can be used to identify strongly connected groups of nodes in the network. We have used an implementation of MCL (v10-148) algorithm from http://www.micans.org/mcl/, which was wrapped as a plug-in and made accessible from the Ondex data integration platform. The inflation coefficient (I) determines the granularity of the clusters produced by the algorithm. A value of I = 2.8 was used for all of the clustering analysis described in this paper.
Assessing the functional coherence of modules
Our aim was to assess the functional coherence of modules by exploring two aspects (i) whether the clusters contain proteins that are generally similar in terms of their functions, as assigned by Gene Ontology terms, i.e. the most representative GO terms in a cluster (ii) the way in which proteins with the same functional roles are distributed across different clusters, i.e. the fragmentation of the GO terms.
with k ∈ C t .
In order to compare the number of fragments and the entropy of fragmentation according to the source of relationship data, we have ranked both of them for each of the GO terms across all five networks. The number of times each of the data sources were assigned the best rank (i.e. the lowest value) was counted and a proportion with respect to the total number of GO categories was calculated. For the sake of brevity from here onwards, we use abbreviations BFRP (best fragment rank proportion) and BERP (best entropy rank proportion) when referring to these comparative measures. These measures provide an intuitive method to compare the networks, as the output can be understood as a percentage of cases where a particular network performed best or at least as good as one of the others.
The integration process was implemented as a set of workflows in the Ondex integrator . The resulting network was visualized and further analyzed in an interactive manner using the Ondex front-end. Ondex contains a command console that supports a variety of common scripting languages and allows the use of external libraries to facilitate the analysis, add additional annotation to nodes and edges and then visualize the results. To carry out the analysis for this paper, we have chosen to use Jython in order to be able to utilize the analysis capabilities offered as part of the NetworkX v0.99 Python graph analysis library . To enable exchange of data between Ondex and NetworkX, we have implemented a method that allows export of a pre-defined subset of an integrated network for NetworkX representation, the results returned were then added as additional annotation to the graph using methods, in Ondex Jython scripting plug-in. Interactive visual exploration of the network used the visualization methods available in Ondex, which include an ability to set the visibility, size/width and colour of nodes and edges based on the numerical values of their attributes and/or group membership.
The authors gratefully acknowledge funding from the UK Biotechnology and Biological Sciences Research Council (BBSRC). AL was supported by a PhD studentship BBS/S/E/2006/13205. KHP and JT were supported by the Ondex BBSRC SABR Grant BB/F006039/1 which also partially supported MS and CJR. Rothamsted Research receives grant in aid from the BBSRC which supported MDP, MS and CJR.
- Alon U: Biological networks: the tinkerer as an engineer. Science 2003, 301(5641):1866–1867. 10.1126/science.1089072View ArticlePubMedGoogle Scholar
- Aittokallio T, Schwikowski B: Graph-based methods for analysing networks in cell biology. Briefings in Bioinformatics 2006, 7(3):243–255. 10.1093/bib/bbl022View ArticlePubMedGoogle Scholar
- Chen H, Sharp BM: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 2004, 5: 147. 10.1186/1471-2105-5-147PubMed CentralView ArticlePubMedGoogle Scholar
- Weston J, Elisseeff A, Zhou D, Leslie CS, Noble WS: Protein ranking: from local to global structure in the protein similarity network. Proceedings of the National Academy of Sciences, USA 2004, 101(17):6559–6563. 10.1073/pnas.0308067101View ArticleGoogle Scholar
- Lee I, Ambaru B, Thakkar P, Marcotte EM, Rhee SY: Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nature Biotechnology 2010, 28(2):149–156. 10.1038/nbt.1603PubMed CentralView ArticlePubMedGoogle Scholar
- Mostafavi S, Morris Q: Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics 2010, 26(14):1759–1765. 10.1093/bioinformatics/btq262PubMed CentralView ArticlePubMedGoogle Scholar
- Bork P, Jensen LJ, von Mering C, Ramani AK, Lee I, Marcotte EM: Protein interaction networks from yeast to human. Current Opinions in Structural Biololgy 2004, 14(3):292–299. 10.1016/j.sbi.2004.05.003View ArticleGoogle Scholar
- Han JD, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, Dupuy D, Walhout AJ, Cusick ME, Roth FP, Vidal M: Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature 2004, 430(6995):88–93. 10.1038/nature02555View ArticlePubMedGoogle Scholar
- Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, et al.: Global mapping of the yeast genetic interaction network. Science 2004, 303(5659):808–813. 10.1126/science.1091317View ArticlePubMedGoogle Scholar
- Gabow AP, Leach SM, Baumgartner WA, Hunter LE, Goldberg DS: Improving protein function prediction methods with integrated literature data. BMC Bioinformatics 2008, 9: 198. 10.1186/1471-2105-9-198PubMed CentralView ArticlePubMedGoogle Scholar
- Myers CL, Robson D, Wible A, Hibbs MA, Chiriac C, Theesfeld CL, Dolinski K, Troyanskaya OG: Discovery of biological networks from diverse functional genomic data. Genome Biology 2005, 6(13):R114. 10.1186/gb-2005-6-13-r114PubMed CentralView ArticlePubMedGoogle Scholar
- Mao L, Van Hemert JL, Dash S, Dickerson JA: Arabidopsis gene co-expression network and its functional modules. BMC Bioinformatics 2009, 10: 346. 10.1186/1471-2105-10-346PubMed CentralView ArticlePubMedGoogle Scholar
- Mentzen WI, Wurtele ES: Regulon organization of Arabidopsis. BMC Plant Biology 2008, 8: 99. 10.1186/1471-2229-8-99PubMed CentralView ArticlePubMedGoogle Scholar
- Wei H, Persson S, Mehta T, Srinivasasainagendra V, Chen L, Page GP, Somerville C, Loraine A: Transcriptional coordination of the metabolic network in Arabidopsis. Plant Physiology 2006, 142(2):762–774. 10.1104/pp.106.080358PubMed CentralView ArticlePubMedGoogle Scholar
- Dittrich MT, Klau GW, Rosenwald A, Dandekar T, Muller T: Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics 2008, 24(13):i223–231. 10.1093/bioinformatics/btn161PubMed CentralView ArticlePubMedGoogle Scholar
- Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L, Zhang N, Li G, Chen R: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Research 2003, 31(9):2443–2450. 10.1093/nar/gkg340PubMed CentralView ArticlePubMedGoogle Scholar
- Myers CL, Troyanskaya OG: Context-sensitive data integration and prediction of biological networks. Bioinformatics 2007, 23(17):2322–2330. 10.1093/bioinformatics/btm332View ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Research 2009, (37 Database):D412–6.
- Pandey J, Koyuturk M, Grama A: Functional characterization and topological modularity of molecular interaction networks. BMC Bioinformatics 2010, 11(Suppl 1):S35. 10.1186/1471-2105-11-S1-S35PubMed CentralView ArticlePubMedGoogle Scholar
- Pandey J, Koyuturk M, Subramaniam S, Grama A: Functional coherence in domain interaction networks. Bioinformatics 2008, 24(16):i28–34. 10.1093/bioinformatics/btn296View ArticlePubMedGoogle Scholar
- Ponomarenko EA, Lisitsa AV, Il'gisonis EV, Archakov AI: [Construction of protein semantic networks using PubMed/MEDLINE]. Molekuliarnaia biologiia 2010, 44(1):152–161.PubMedGoogle Scholar
- Chuang HY, Lee E, Liu YT, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Molecular Systems Biology 2007, 3: 140.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genetics 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Bastos H, Ferreira AE, Falcao AO, Couto FM: Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinformatics 2008, 9(Suppl 5):S4. 10.1186/1471-2105-9-S5-S4PubMed CentralView ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng B, Lu X: Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biology 2007, 8(7):R153. 10.1186/gb-2007-8-7-r153PubMed CentralView ArticlePubMedGoogle Scholar
- Xu T, Gu J, Zhou Y, Du L: Improving detection of differentially expressed gene sets by applying cluster enrichment analysis to Gene Ontology. BMC Bioinformatics 2009, 10: 240. 10.1186/1471-2105-10-240PubMed CentralView ArticlePubMedGoogle Scholar
- Alexa A, Rahnenfuhrer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22(13):1600–1607. 10.1093/bioinformatics/btl140View ArticlePubMedGoogle Scholar
- Richards AJ, Muller B, Shotwell M, Cowart LA, Rohrer B, Lu X: Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 2010, 26(12):i79–87. 10.1093/bioinformatics/btq203PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Jansen R, Stolovitzky G, Gerstein M: Total ancestry measure: quantifying the similarity in tree-like classification, with genomic applications. Bioinformatics 2007, 23(16):2163–2173. 10.1093/bioinformatics/btm291View ArticlePubMedGoogle Scholar
- Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 23(10):1274–1281. 10.1093/bioinformatics/btm087View ArticlePubMedGoogle Scholar
- Ruths T, Ruths D, Nakhleh L: GS2: an efficiently computable measure of GO-based similarity of gene sets. Bioinformatics 2009, 25(9):1178–1184. 10.1093/bioinformatics/btp128PubMed CentralView ArticlePubMedGoogle Scholar
- Chagoyen M, Carazo JM, Pascual-Montano A: Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008, 9: 444. 10.1186/1471-2105-9-444PubMed CentralView ArticlePubMedGoogle Scholar
- Joslyn CA, Mniszewski SM, Fulmer A, Heaton G: The gene ontology categorizer. Bioinformatics 2004, 20(Suppl 1):i169–177. 10.1093/bioinformatics/bth921View ArticlePubMedGoogle Scholar
- Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K: ATTED-II provides coexpressed gene networks for Arabidopsis. Nucleic Acids Research 2009, (37 Database):D987–91.
- The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408(6814):796–815. 10.1038/35048692View ArticleGoogle Scholar
- Hassani-Pak K, Legaie R, Canevet C, van den Berg HA, Moore JD, Rawlings CJ: Enhancing data integration with text analysis to find proteins implicated in plant stress response. Journal of Integrative Bioinformatics 2010., 7(3):
- Lysenko A, Hindle MM, Taubert J, Saqi M, Rawlings CJ: Data integration for plant genomics--exemplars from the integration of Arabidopsis thaliana databases. Briefings in Bioinformatics 2009, 10(6):676–693. 10.1093/bib/bbp047View ArticlePubMedGoogle Scholar
- Brandao MM, Dantas LL, Silva-Filho MC: AtPIN: Arabidopsis thaliana protein interaction network. BMC Bioinformatics 2009, 10: 454. 10.1186/1471-2105-10-454PubMed CentralView ArticlePubMedGoogle Scholar
- Koehler J, Jan Baumbach , Taubert J, Specht M, Skusa A, Rüegg A, Rawlings C, Verrier P: SP: Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics 2006, 22(11):1383–90. 10.1093/bioinformatics/btl081View ArticleGoogle Scholar
- van Dongen S: A cluster algorithm for graphs. National Research Institute for Mathematics and Computer Science 2000. [http://www.cwi.nl/ftp/CWIreports/INS/INS-R9814.ps.gz]Google Scholar
- Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, van Eijk K, Hermjakob H: The IntAct molecular interaction database in 2010. Nucleic Acids Research 2010, (38 Database):D525–531.
- Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Research 2009, (37 Database):D396–403.
- UniProt Consortium: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Research 2010, (38 Database):D142–148.
- Ulitsky I, Shamir R: Identification of functional modules using network topology and high-throughput data. BMC Systems Biology 2007, 1: 8. 10.1186/1752-0509-1-8PubMed CentralView ArticlePubMedGoogle Scholar
- Wong L: Constructing More Reliable Protein-Protein Interaction Maps. International Symposium on Computational Biology & Bioinformatics 2008, 284–297.Google Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22(13):1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Ng AY, Jordan MI, Weiss Y: On spectral clustering: analysis and an algorithm. Neural Information Processing Systems 2002, 14: 849–856.Google Scholar
- Deng M, Sun F, Chen T: Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput 2003, 140–151.Google Scholar
- Bader JS, Chaudhuri A, Rothberg JM, Chant J: Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol 2004, 22(1):78–85. 10.1038/nbt924View ArticlePubMedGoogle Scholar
- Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, Kalyana-Sundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 2005, 23(8):951–959. 10.1038/nbt1103View ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics 2009, 25(15):1891–1897. 10.1093/bioinformatics/btp311View ArticlePubMedGoogle Scholar
- Apeltsin L, Morris JH, Babbitt PC, Ferrin TE: Improving the quality of protein similarity network clustering algorithms using the network edge weight distribution. Bioinformatics 2011, 27(3):326–333. 10.1093/bioinformatics/btq655PubMed CentralView ArticlePubMedGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7(1):488. 10.1186/1471-2105-7-488PubMed CentralView ArticlePubMedGoogle Scholar
- Vlasblom J, Wodak SJ: Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics 2009, 10: 99. 10.1186/1471-2105-10-99PubMed CentralView ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 2003, 4: 2. 10.1186/1471-2105-4-2PubMed CentralView ArticlePubMedGoogle Scholar
- Loewenstein Y, Portugaly E, Fromer M, Linial M: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 2008, 24(13):i41–49. 10.1093/bioinformatics/btn174PubMed CentralView ArticlePubMedGoogle Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 2006, 7: 207. 10.1186/1471-2105-7-207PubMed CentralView ArticlePubMedGoogle Scholar
- Wittkop T, Baumbach J, Lobo FP, Rahmann S: Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing. BMC Bioinformatics 2007, 8: 396. 10.1186/1471-2105-8-396PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang P, Singh M: SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 2010, 26(8):1105–1111. 10.1093/bioinformatics/btq078PubMed CentralView ArticlePubMedGoogle Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 2006, 7: 488. 10.1186/1471-2105-7-488PubMed CentralView ArticlePubMedGoogle Scholar
- Obayashi T, Kinoshita K, Nakai K, Shibaoka M, Hayashi S, Saeki M, Shibata D, Saito K, Ohta H: ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis. Nucleic Acids Research 2007, (35 Database):D863–869.
- Obayashi T, Hayashi S, Saeki M, Ohta H, Kinoshita K: ATTED-II provides coexpressed gene networks for Arabidopsis. Nucleic Acids Research 2009, (37 Database):D987–991.
- Shannon CE: The mathematical theory of communication. 1963. MD Computing 1997, 14(4):306–317.PubMedGoogle Scholar
- Canevet C: Ondex tutorial and user guide. Ondex SABR project documentation 2010.Google Scholar
- Hagberg AA, Schult DA, Swart PJ: Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science Conference (SciPy2008) 2008, 11–15.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.