Quantification of protein group coherence and pathway assignment using functional association
© Chitale et al; licensee BioMed Central Ltd. 2011
Received: 16 February 2011
Accepted: 19 September 2011
Published: 19 September 2011
Genomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation. An important step in analyzing such data is to identify functional units, which consist of proteins that play coherent roles to carry out the function. Importantly, functional coherence is not identical with functional similarity. For example, proteins in the same pathway may not share the same Gene Ontology (GO) terms, but they work in a coordinated fashion so that the aimed function can be performed. Thus, simply applying existing functional similarity measures might not be the best solution to identify functional units in omics data.
We have designed two scores for quantifying the functional coherence by considering association of GO terms observed in two biological contexts, co-occurrences in protein annotations and co-mentions in literature in the PubMed database. The counted co-occurrences of GO terms were normalized in a similar fashion as the statistical amino acid contact potential is computed in the protein structure prediction field. We demonstrate that the developed scores can identify functionally coherent protein sets, i.e. proteins in the same pathways, co-localized proteins, and protein complexes, with statistically significant score values showing a better accuracy than existing functional similarity scores. The scores are also capable of detecting protein pairs that interact with each other. It is further shown that the functional coherence scores can accurately assign proteins to their respective pathways.
We have developed two scores which quantify the functional coherence of sets of proteins. The scores reflect the actual associations of GO terms observed either in protein annotations or in literature. It has been shown that they have the ability to accurately distinguish biologically relevant groups of proteins from random ones as well as a good discriminative power for detecting interacting pairs of proteins. The scores were further successfully applied for assigning proteins to pathways.
Elucidating the role of proteins is a central problem in molecular biology. Computational methods play indispensable roles in various aspects of the functional elucidation of proteins, including database searches [1, 2], capturing motifs and features in sequences [3–7], structures [8–10], and in experimental data , as well as clustering of proteins by functional similarity . The importance and expectations of computational methods are further highlighted in the systems biology where a flood of sequenced genomes and various types of omics data are awaiting functional elucidation [13–18].
Realizing weaknesses of conventional homology search methods, e.g. limited coverage in genome annotations and the need for homologous proteins [17–20], various new approaches for function prediction have been developed in the past decade. Those include methods which use the sequence information in an elaborated fashion [21–27], those which compare the global and local tertiary structure information , and methods which use large-scale experimental data of proteins [11, 28–35].
Besides function prediction, computational methods are also required for the interpretation of large-scale experimental data in the biological context . Omics data, such as protein-protein interaction networks [36–40], microarray gene expression data [41, 42], expression data by mass spectrometry  or by RNAseq [44, 45], provide rich source of information for systems-level understanding of the protein interplay. Clustering genes by functional similarity is an indispensable step in finding the underlying biological principles behind the observed data.
To enable the above mentioned computational function analyses, it is necessary to establish a measure that quantifies functional associations between proteins. Controlled vocabularies of annotation terms, such as the Gene Ontology (GO) , provide a convenient platform for handling text description of the roles of gene products (RNA and protein). GO classifies annotation terms into three domains, Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Terms in each domain are organized in a hierarchical fashion as a Directed Acyclic Graph (DAG). The similarity between a pair of GO terms or, more generally between two sets of GO terms can be defined in several different ways. Most simply, two sets of GO terms can be compared by head to head matching where the similarity can be determined by the number of common annotations from both the sets . Based on the GO hierarchy, the similarity of two GO terms can be defined as the minimum path length between them on the GO DAG [47, 48]. A better alternative to the minimum path is to consider the Lowest Common Ancestor (LCA) for a pair of GO terms in the hierarchy, for which the Information Content (IC) is computed [49–51]. Schlicker et al. have developed a score named funsim, which combines the similarity of GO terms in BP and MF domains based on IC of LCA . In the Methods section we discuss their scoring scheme in details.
The pairwise functional similarity may be suitable for certain purposes, e.g. for evaluating the accuracy of function prediction or for investigating functional similarity between a particular protein to others (e.g. homologous proteins). However, the situation can be different in omics data analyses, where many genes rather than a pair need to be handled to identify the set of gene products that are working in functionally coherent fashion. Functional coherence is exhibited in biologically relevant protein sets, for example, in the same biological pathways, subcellular localizations, the same protein complexes, proteins involved in the same stage of development, and disease. Importantly, proteins in a functionally coherent set may not necessarily have the same or similar GO terms in all the three GO domains, but their GO terms should be coherent with respect to each other so that the aimed function can be performed in a coordinated fashion. As an illustration, consider proteins in the same KEGG pathway. These proteins have different MF annotations because they carry out different enzymatic reactions. Moreover, interestingly, in general they also do not necessarily share a common BP annotation. For example, the pyruvate metabolism pathway (KEGG pathway ID: 00620) has 33 proteins, which are annotated with 48 unique BP domain terms. Among them there are only 8 proteins that are annotated with pyruvate metabolic process (GO:0006090) and each of the rest of the 47 GO BP terms are assigned to fewer number of proteins. The data for all the 101 KEGG pathways of yeast has been made available as Additional File 1. This can be caused by several reasons. One of the reasons is that the classification of the whole metabolic pathway into sub-pathways may differ from database to database. For example, the KEGG pathway database is not constructed by referring to the Gene Ontology annotations of genes. Another reason is that sometimes proteins are annotated with a BP term at a different specificity (child/parent terms). And of course the incompleteness of GO annotation could be another reason. Thus, even if all the BP domain annotations for the set of proteins are known, it would not be trivial to decide if the set is coherent by simply applying the existing pairwise functional similarity measures.
There are only a handful of previous works done for assessing the functional coherence. A type of related works consider GO terms that are enriched in a protein group [30, 34, 35, 53, 54]. However, it was discussed that statistically significant enrichment of certain GO terms evaluated using the hypergeometric distribution often does not indicate functional units in biological pathways . Recently, Chagoyen et al. treated BP annotations of proteins as a vector of GO terms and computed pairwise protein similarity using the cosine distance . They compute overall homogeneity of a set by averaging all the pairwise similarities between proteins in the set, and further assess the statistical significance of the coherence score. Pandey et al. have extended the concept of pairwise common ancestors of GO terms to the set of most specific common ancestors of the annotation sets of two proteins [57, 58]. They have studied this functional coherence measure in the context of topological proximity of proteins in PPI and domain-domain interaction networks. Zheng et al.  performed text mining on research papers in the MEDLINE database  to represent the semantic content of a document in terms of presence of topics in the document. The documents are associated with proteins, which provide the protein-topic association as a graph. Then, closeness of proteins on this graph is used to determine the functional coherence of a group of proteins.
In this work, we propose two association scores for GO terms, which are aimed to evaluate the functional coherence of sets of proteins. The proposed scores quantify the associations of GO terms as the frequency of co-occurrence in two different biological contexts, in the GOA  protein sequence annotations and in the PubMed database literature . The former score is named the Co-occurrence Association Score (CAS) while the latter is named the PubMed Association Score (PAS). We quantify the GO term associations by applying a method used for computing the knowledge-based statistical potentials for amino acid contacts [61, 62], which is widely used in protein structure prediction. Unlike existing works which define similarity based on the GO hierarchy, our scores directly reflect how well terms are associated in the actual biological context. Since the associations are not restricted to the GO hierarchy, we can quantify association between terms across different GO domains. The novel and advantageous characteristic of our scores is that they quantify the functional coherence and not necessarily the similarity. Recently the GO database has newly introduced the relationships between Molecular Function (MF) and Biological Process (BP) domains to represent biological knowledge about the pathways and roles of genes . Compared with their recent effort, our approach is more general, flexible, and automatic in the sense that the considered associations include knowledge from within the GO hierarchy as well as outside its structure. Resulting GO term associations reflect the current actual annotations in the databases.
We demonstrate that the developed association scores can identify functionally coherent protein sets, i.e. proteins in the same KEGG pathways, cellular locations, and protein complexes better than the above mentioned existing methods. In addition, we also show that these functional coherence score can accurately assign proteins to the KEGG pathways where the proteins belong. The current approach can be easily applied to other biological data sources to mine the associations and other ontologies as well, since it is not assuming any underlined structure in the ontology.
CAS and PAS coherence scores
We have developed two function association scores, the CAS and the PAS. The CAS quantifies the frequency of GO terms that co-occur in the gene annotations, while the PAS takes into account co-occurrence of GO terms in the PubMed abstracts. The Gene Ontology database used in this study contains 17,316 Biological Process (BP), 2,534 Cellular Component (CC), and 9,428 Molecular Function (MF) domain terms, which result in a total of 29,278 terms. Among over 857,201,284 possible GO term pairs, 5,610,201 pairs (0.654%) obtained a non-zero value for the CAS while 3,320,265 pairs (0.387%) had a non-zero PAS.
Examples of cross-domain GO term pairs which have a high CAS or PAS
GO ID 1
GO ID 2
Photinus-luciferin 4-monooxygenase activity
Chloroplast ATP synthase complex
photosynthetic electron transport in photosystem II
RNA strand annealing activity
Group II intron splicing
Glycoprotein transporter activity
negative regulation of cholesterol storage
Gram-negative bacterial cell surface binding
alanine-oxo-acid transaminase activity
L-alanine catabolic process, by transamination
host cell surface binding
viral envelope fusion with host membrane
3-cyanoalanine hydratase activity
cyanide metabolic process
nucleoside-triphosphate diphosphatase activity
pyrimidine nucleoside triphosphate catabolic process
peroxisome membrane targeting sequence binding
protein import into peroxisome membrane
Cell leading edge
regulation of microtubule cytoskeleton organization
meiotic cohesin complex
sister chromatid segregation
structure-specific DNA binding
DNA replication, Okazaki fragment processing
3'-tRNA processing endoribonuclease activity
tRNA 3'-trailer cleavage, endonucleolytic
nicotinamide riboside kinase activity
NAD biosynthesis via nicotinamide riboside salvage pathway
prenylcysteine oxidase activity
prenylcysteine catabolic process
cystathionine beta-lyase activity
methionine biosynthetic process from L-homoserine via cystathionine
nicotinamide riboside hydrolase activity
NAD biosynthesis via nicotinamide riboside salvage pathway
The latter ten examples are cases where the PAS is higher than the CAS. Since the PAS ranges at lower values than the CAS (Figure 3), the substantial difference of the PAS and the CAS in these examples is more significant than they seem from the absolute score values. The first of these, TRAMP polyadenylation complex (e.g. UniProt AC: Q9P795), is involved in the post-transcriptional quality control mechanisms, including RNA surveillance and degradation of a wide range of nuclear RNAs including some of the non-protein coding RNA transcriptions (ncRNAs), by stimulating the 3' to 5' exonuclease activity of the exosome . The second example is about BRCA1-A complex (e.g. UniProt AC: Q9NWV8), which binds to the k63 linked polyubiquitin chains present on the histone at the DNA damage sites and may facilitate the deubiquitinating activity of the deubiquitination enzyme BRCC36 . The third GO pair is mined from the literature which reports the role of microtubules and actin filament networks in directed cell migration . The cell leading edge refers to the area of a motile cell closest to the direction of motion which clearly requires actin microtubules for the movement. The next GO pair captures the information about sister chromatid cohesion during meiotic differentiation, which is mediated by a cohesion complex . The fifth example is about the Calf 5' to 3' exo/endonuclease (the human counterpart of which is flap endonuclease-1) (e.g. UniProt AC: P39748) that is involved in the structure specific cleavage of DNA and processes Okazaki fragments during DNA replication . The last five examples provide the missing links between MF and BP terms based on high PAS values, for example MF term GO:0001735 prenylcysteine oxidase activity is frequently mentioned in literature discussing a protein that plays a role in GO:0030328 prenylcysteine catabolic process.
Examples of concurrent GO terms based on CAS
Concurrent GO terms
glutamine catabolic process
glutamine metabolic process
vitamin B6 biosynthetic process
glutamine family amino acid catabolic process
vitamin B6 metabolic process
maltose catabolic process
glycogen debranching enzyme activity
maltose metabolic process
deoxyribodipyrimidine photo-lyase activity
DNA photolyase activity
pyrimidine dimer repair
DNA strand renaturation
bubble DNA binding
double-strand break repair via single-strand annealing
DNA strand annealing activity
ATP-dependent 3'-5' DNA helicase activity
DNA secondary structure binding
nucleotide-excision repair factor 2 complex
nucleotide-excision repair factor 4 complex
nucleotide-excision repair, DNA damage recognition
Cul3-RING ubiquitin ligase complex
nucleotide-excision repair complex
Examples of concurrent GO terms based on PAS
Concurrent GO terms
lactose synthase activity
N-acetyllactosamine synthase activity
beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity
oligosaccharide biosynthetic process
ubiquitin-protein ligase activity
small conjugating protein ligase activity
regulation of ubiquitin-protein ligase activity
vasopressin activated calcium mobilizing receptor activity
ubiquitin-ubiquitin ligase activity
ISG15 ligase activity
iron ion transmembrane transport
regulation of iron ion transmembrane transport
iron ion transmembrane transporter activity
cobalt ion transmembrane transporter activity
cadmium ion transmembrane transport
pyridine nucleoside metabolic process
pyridine nucleoside catabolic process
NAD biosynthesis via nicotinamide riboside salvage pathway
nicotinamide riboside kinase activity
nicotinamide riboside catabolic process
nicotinamide riboside metabolic process
hemoglobin alpha binding
hemoglobin beta binding
hemoglobin metabolic process
QuickGO , which is a recently built Gene Ontology browser, also provides functionality to browse co-occurring GO terms. This is similar to what the CAS captures but they have notable differences due to their diverse purposes. As the primary purpose of QuickGO is to browse the GO easily, it shows co-occurring GO terms for a specific query GO term. The score (named the S% score) used to sort the co-occurring terms for a specified GO term has direction (i.e. the score for A to B and B to A can be different). In contrast, the CAS is not directional as it is designed for identifying the biologically coherent protein groups by capturing the GO term association. Moreover, CAS also considers the associations of parental GO terms to capture more associations. And, of course, the PAS is totally different because it captures co-mentions in PubMed abstracts.
To summarize, the CAS and the PAS have moderate correlation with an existing score, funsim. The CAS and the PAS capture associations within the same domain as well relationship between cross-domain GO terms unlike funsim, which only defines the similarity between pair of GO terms from the same domain. Notably, CAS and PAS capture many biologically relevant cross-domain GO term associations (like MF-BP, BP-CC examples from Table 1) and thus can be used to obtain missing process-function links between GO terms as well as to find concurrent annotations across all the three GO domains.
Coherence scores computed for biologically related protein sets
In addition to the CAS and the PAS coherence scores developed here, we have also used three existing functional similarity scores, the modified funsim score , a score proposed by Chagoyen et al.  (termed the Chagoyen score), and a score by Pandey et al. [57, 58] (the Pandey score). Briefly, the Chagoyen score computes the dot product of the information content of BP terms of proteins while the Pandey score considers the fraction of proteins in the database which are annotated by a common GO ancestor set of two proteins in question. An example of most specific pairwise common ancestor of terms GO:0001948 glycoprotein binding and GO:0030492 hemoglobin binding is their deepest shared GO ancestor term GO:0005515 protein binding. See Methods for derivation of the Chagoyen and the Pandey scores. For all the five scores, the coherence of a set of proteins is defined as the average of the scores for all the pairs of proteins.
Before analyzing the protein datasets in Figure 5, we have examined the dependence of the five scores to the size of the protein sets (Additional File 2: Figure S1). The verification was performed using 500 random yeast protein sets of sizes varying from 10 to 100 with an interval of 10. Since Figure S1 from Additional File 2 shows that the average scores do not significantly change by set sizes for all the five scores, we concluded that there is no need for normalization of the scores by the size of protein sets. To evaluate the statistical significance of the scores, we compute the p-value for all the coherence scores. The p-value assesses the number of proteins in the set that have a significantly higher coherent score as compared with the random chance (see Methods).
Similar trends were observed for the protein complex sets (Figure 6B) and the GOcc sets (Figure 6C). For both datasets, the three scores (CAS, PAS, and Chagoyen) showed significantly better performance than Pandey and funsim scores. For the protein complex sets (Figure 6B), CAS, PAS, Chagoyen, Pandey, and funsim scores recognized 76.25%, 77.0%, 69.25%, 44.25%, and 3.25% of the protein sets, respectively, at the p-value cutoff of 0.05. In the case of the GOcc sets (Figure 6C), 99.79%, 99.16%, 95.42%, 67.35%, and 7.27% of the sets are recognized by CAS, PAS, Chagoyen, Pandey, and funsim scores, respectively. Figure 6D shows that the five scores do not provide significant p-value (0.05 or lower) to most of the randomly generated protein sets. Overall the CAS and the PAS showed better discriminative performance in identifying the functionally related protein sets than the other three existing scores compared.
Coherence scores excluding obvious GO domain
Proteins in the same KEGG pathways are likely to share the similar GO terms in the BP domain (child/parent terms) used to describe the same biological process. Also proteins in the same group in the GOcc dataset have the same CC term by design. Here we reevaluate the CAS and the PAS coherence score for the KEGG pathway dataset and the GOcc dataset by excluding the apparently related GO domain. Note that the other three scores compared in Figure 6 also integrate BP and/or CC terms: The funsim score combines GO terms from all the three domains while the Pandey score uses BP and MF terms. The Chagoyen score only evaluates terms in the BP domain. However, we did not examine the effect of removing BP or CC terms from these three scores because the funsim and the Pandey score performed significantly more poorly than the PAS and the CAS (Figure 6) and removing BP or CC terms would simply further deteriorate the results. As for the Chagoyen score, it cannot be defined without BP terms.
Detecting protein-protein interactions
Next, we test the proposed functional coherence scores on the protein-protein interaction (PPI) networks of yeast and human. We examine if the scores are able to detect the interacting proteins (true positives) as opposed to the non-interacting protein pairs (true negatives). The yeast PPI network contains 72,053 interacting protein pairs while 33,099 interactions are included in the human PPI data (see Methods). The same number of non-interacting protein pairs as the interacting protein pairs are extracted from the proteins included in the PPI networks. The p-value for pairs of proteins is computed for the CAS (Eqn. 3), the PAS (Eqn. 4), the funsim (Eqn. 11), the Chagoyen (Eqn. 17), and the Pandey (Eqn. 21) scores, and they are sorted in ascending order of the p-value. Then we computed the Receiver Operator Characteristic (ROC) curves for each scores on the yeast and the human PPI datasets.
KEGG pathway assignment
Finally, we used the functional coherence scores to predict the most likely KEGG pathway in which the protein plays a role. For a query protein the coherence score is computed against each KEGG pathway and then the pathways are sorted and ranked based on the coherence score. We examined if the correct pathway is scored at the top ranks. For this experiment, the KEGG pathway dataset which contains 101 pathways was used and cumulative percentages of proteins that are assigned correctly to their pathway were computed. Eight scores were compared. In addition to the CAS_coherence (Eqn. 7), PAS_coherence (Eqn. 8), funsim_coherence (Eqn. 14), GOscore_coherenceBP (Eqn. 15) Chagoyen_coherence (Eqn. 19), and the Pandey_coherence (Eqn. 24), the CAS and the PAS were also computed without the BP annotations, CAS(BP-) and PAS(BP-). This is to remove the potentially apparent information of pathways encoded in the BP terms (i.e. proteins in the same KEGG pathway share the same BP terms in many cases). As for the funsim score, we have also used only BP annotations, which is referred as GOscore_coherenceBP, because the funsim score did not perform well in the previous experiments in Figures 6 and 10.
We have developed and critically analyzed coherence measures for a set of proteins, which can distinguish the biologically relevant sets from the random ones. By moving away from conventional methods, which rely on the hierarchical structure of the GO terms, we have designed a novel technique that can incorporate knowledge about the GO terms to find the strength of their association. The scores are computed based on the observed associations of the GO terms. The first score, Co-occurrence Association Score (CAS), considers the frequency that pairs of GO terms have been annotated to the same proteins. On the other hand, the PubMed Association Score (PAS) quantifies the number of occurrences that GO term pairs appear in literature abstracts as compared to the random chance. While most common form of the relationship defined by the GO is between the terms of the same domain (is a relation) where one term is a more specific representation of the other, there are some new relationships which connect MF-BP terms (part of, regulates relations). By using the CAS and the PAS we can automatically find the strength of associations between terms from any two domains of GO like MF-BP or BP-CC or CC-MF, and these associations are not restricted to the relationships provided by the GO hierarchy. About 36% of the CAS and the PAS associations are for cross-domain GO term pairs, and their scores are comparable to the same domain terms (Figures 1 & 2). The CAS and the PAS capture different aspects of GO associations. While the CAS focuses on molecular level relationships of functional descriptions, the PAS often reveals the background knowledge of biologists.
To investigate the characteristics of the CAS and the PAS, we evaluated the two scores on three biologically coherent datasets, namely, the proteins in the same KEGG pathways, proteins that physically interact, and proteins which co-localize in a cell. The CAS and the PAS identified proteins in the same KEGG pathways, complexes, and co-localization with statistically significant scores (Figure 6) and were able to distinguish proteins which physically interact from those which do not (Figure 10). Moreover, the CAS and the PAS correctly assigned about 80-94% of proteins to the KEGG pathways they belong to within the top ten ranks. To the best of our knowledge, this is the first attempt to assign proteins to the KEGG pathways by evaluating the functional coherence. The performance of the CAS and the PAS was superior to the other related existing scores compared.
Counting associations of data is simple yet very powerful in revealing hidden rules behind the observed phenomena. Advanced techniques on considering data associations have been studied in the data mining and the machine learning area, which are applied, for example, in marketing [77–79]. Instead of the rather straightforward way of counting associations, using advanced methods, such as a measure of interestingness of association rules  and relational rule learning , would further improve the performance of the coherence scores. Specifically, the PAS may be further polished by applying text mining techniques that analyze the grammatical structure of sentences and relationships between phrases [82, 83]. Furthermore, it will also be interesting to apply the same technique for evaluating the GO term co-occurrence in different biological contexts, such as gene expression data, regulatory pathways, and directly from PPI networks.
In this work we showed that the CAS and the PAS can identify biologically coherent proteins by capturing the GO term associations. The PAS and the CAS will also benefit for predicting biological function of un-annotated genes. Indeed there are previous works which use the GO term associations for predicting the gene function. King et al.  used co-occurring GO terms for predicting gene function by modeling relationships of GO terms with decision trees and Bayesian networks. Our group has developed a gene function prediction method, named PFP [22, 23], which considers the GO term associations observed in a database in a similar way to the CAS. PFP first retrieves similar sequences to a query from a sequence database using PSI-BLAST , then, extracts GO terms which directly annotate the retrieved sequences as well as strongly associated GO terms to the GO annotations of the retrieved sequences. GO associations are described as conditional probabilities. The extracted GO terms are finally scored according to the frequency of the occurrence in the retrieved sequences and the E-values of the sequences. PFP achieved significantly higher prediction accuracy as compared with a naive way of using PSI-BLAST and some existing methods. Moreover, we can first predict the GO terms for un-annotated proteins by PFP and then apply PAS/CAS to identify which biological context the proteins play a role in.
An ultimate goal of biological studies is to understand the underlined structures and relationships of the biological entities which realize the observed phenomena. Such systematic understanding is accompanied with constructions of networks of relationships of terms in vocabularies that describe and label the biological entities. We believe that this work provides a pivotal step that brings us forward towards systematic understanding and description of a functions and mechanisms of proteins, organelle, cells, and higher level structures of life.
Two function coherence scores were developed, one which reflects the co-occurrence of GO terms in protein annotations (CAS) and one which considers co-mentions of terms in the literature (PAS). The CAS and the PAS are shown to have the ability to accurately separate biologically relevant groups of proteins, i.e. proteins in the same pathways, protein complexes, and those with the same localization, from random sets. It was also shown that the CAS and the PAS can be used to detect physically interacting protein pairs. The scores were further successfully applied for assigning proteins to the KEGG pathways. The method can be readily applied to mine the functional associations between proteins from various biologically relevant sets.
Gene Ontology database
The hierarchical structure of Gene Ontology (GO) and GO term definitions are obtained from the Gene Ontology Consortium [46, 86]http://archive.geneontology.org/ database version 2009-08. The Gene Ontology Annotation (GOA) database  version 2009-10 is used for the association between UniProt  identifiers and GO terms http://www.ebi.ac.uk/GOA/archive.html. Inferred Electronic Annotations (IEA) were excluded to increase the reliability of functional data. There are 46,686 protein - GO term association pairs for Saccharomyces cerevisiae (yeast) and 90,823 associations for Homo sapiens (human).
We used the NCBI's Entrez ESearch utility for obtaining the count of PubMed abstracts related to the particular GO terms. For example, for computing the PubMed association between terms GO:0003700 and GO:0051169, we first obtain their respective term definitions as 'transcription factor activity' and 'nuclear transport' from the GO database and remove words 'and, or, not' from their definitions. The remaining words in the definition are used to construct URL, e.g. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=xml&rettype=full&term=transcription+factor+activity, which yields an xml that is then parsed to obtain the count of PubMed abstracts associated with the given term. For retrieving the counts of abstracts with two GO terms we appended the terms in the query URL and obtain the count. The ESearch query interface uses the MeSH indexing to incorporate the synonyms and the term variations. This provides us with a convenient way to retrieve the information that has been represented using different terms for the same concepts. The January 2010 version of the PubMed database was used.
Biologically coherent sets of proteins
A coherent set of proteins are those which take part in the same biological context in a cell. For example, they can be a set of proteins playing roles in the same pathway, proteins involved in a disease or those responsible in a certain stage of development. Here we have prepared three types of coherent sets of yeast proteins: proteins in the same KEGG pathways, proteins included in the same protein complexes, and those which have the same subcellular localization. Along with these, two datasets of interacting protein pairs from yeast and human were prepared Details are described below. All the datasets are available at http://kiharalab.org/functionSim/.
Yeast KEGG pathway dataset
We downloaded yeast pathways from the ftp site of the Kyoto Encyclopedia of Genes and Genomes (KEGG) database . This dataset consists of proteins in 101 pathways. The pathway size (the number of proteins in a pathway) ranges from 2 to 123 proteins with most of the pathways having around 20 proteins (Figure 5A). UniProtKB/Swiss-Prot database  (Version 2009-03) has been used for obtaining identifier mapping from KEGG database  identifiers and yeast SGD  identifiers.
Yeast protein complex dataset
For the yeast protein complex dataset, we have used a latest catalogue, YHTP2008 of 400 protein complexes compiled from genome-wide high throughput studies by Pu et al. http://wodaklab.org/cyc2008/downloads. The catalogue provides protein complexes with Saccharomyces Genome Database (SGD)  identifiers, which are transferred to UniProt identifiers for associating them with the corresponding GOA annotations. The set sizes are shown in Figure 5B. Most of the protein complexes have about five or less component proteins with a few exceptions such as ribosomal complex whose size is 176.
Yeast GO cellular component (GOcc) datasets
We have constructed sets of yeast proteins with the same cellular component (CC) GO terms. Yeast proteins with non IEA GO annotations in the CC domain are selected from the GOA database. Then, for each such yeast proteins, CC terms are enriched by using the parental annotation transfers based on the true path in the GO hierarchy. Thus all ancestors of a GO term are incorporated as annotations for a protein. A total of 560 protein sets were obtained with sizes ranging from 2 to 4814. Very large protein sets contain proteins with a too general CC term. Therefore 481 sets with a size up to 100 were selected for analysis (Figure 5C).
Protein-Protein Interaction (PPI) data
We have used Saccharomyces cerevisiae (budding yeast) and Homo sapiens (human) interaction data available at the BioGRID database  (version BIOGRID-2.0.56). In BioGRID data, only physical interactions and proteins with a UniProt identifier and with at least one GO annotation are used. The interactions are binary and thus no weight is associated with the edges in the PPI networks. For yeast and human, we have 72,053 and 33,099 interacting protein pairs, respectively. The number of proteins involved in the interactions is 4833 for yeast and 6241 for human.
In addition to the experimentally identified PPI networks, we have generated random protein-protein interactions. This is for two purposes, one for the null distribution of functional similarity scores for interacting proteins, and the other for computing the ROC curve. For both yeast and human proteins, 100,000 pairs each are randomly generated comprising of null distribution. For the ROC curve computation, we generate the same number of random interactions (false positive) as the actual interaction in each of the organisms.
Co-occurrence Association Score (CAS)
Here C(i) is the number of sequences in the database which have GO term i. Similarly, C(i,j) is the number of sequences in the database which have a pair of GO terms, i and j. Thus, the numerator quantifies the fraction of sequences with annotations i and j relative to the total number of GO term pairs annotating the same proteins. The denominator is the expected number of times the two GO terms, i and j, co-occur in single proteins. This formulation is essentially similar to the method to compute a knowledge-based statistical amino acid contact potential [61, 62].
For the GO terms annotating sequences in the GOA database, those with the evidence code of Inferred electronic annotations (IEA) are discarded. Along with the original annotations, parental GO terms to the original GO term annotations following the true path rule are also considered in computing the CAS. This procedure adds information of the GO hierarchy in the scoring scheme in an implicit fashion. GO pairs which do not co-occur in a gene are assigned with zero for their CAS.
PubMed Association Score (PAS)
where Pub(i, j) is the number of PubMed abstracts which have two GO terms i and j, and Pub(i) is the number of abstracts which have a GO term i. Because PubMed includes nearly 19 million references, it is computationally challenging to obtain the exact total number of abstracts for all the co-occurring pairs in the database, Σ k, l Pub(k, l). Thus, for the second term, which can be considered as a scaling factor for PAS(i, j), the corresponding value computed for the CAS in Eqn. 1 is used.
Protein pair association measure
P xi refers to the i th annotation of protein P x . Thus, each annotation for P x is compared with all GO terms from P y , and the one which gives the maximum score is chosen. Then the best matching score for each P xi is averaged by 1/A x . The same procedure is performed for P y , and a larger value is taken as the CAS or the PAS association between the two proteins, P x and P y . This matrix based comparison is proposed by Schlicker et al. .
Protein set coherence score
Semantic Similarity based coherence score
We compare the CAS and the PAS coherence score with three existing related scores, the semantic similarity score , a score designed by Chagoyen et al.  and another one by Pandey et al. [57, 58]. The latter two scores will be explained in the subsequent sections.
Each GOscore is squared following to the original funsim score proposed by Schlicker et al.
Chagoyen coherence score
Pandey coherence score
where c l ≤ c k indicates c k is ancestor of c l .
Statistical significance of coherence score of a protein set
MC is supported by a grant from Purdue Research Foundation and Showalter Trust. DK is supported by grants from National Institutes of Health (R01GM075004) and National Science Foundation (DMS800568, EF0850009, and IIS0915801).
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990, 183: 63–98.View ArticlePubMedGoogle Scholar
- Attwood TK, Bradley P, Flower DR, Gaulton A, Maudling N, Mitchell AL, Moulton G, Nordle A, Paine K, Taylor P, et al.: PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 2003, 31: 400–402. 10.1093/nar/gkg030PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al.: Pfam: clans, web tools and services. Nucleic Acids Res 2006, 34: D247-D251. 10.1093/nar/gkj149PubMed CentralView ArticlePubMedGoogle Scholar
- Gaulton A, Attwood TK: Motif3D: Relating protein sequence motifs to 3D structure. Nucleic Acids Res 2003, 31: 3333–3336. 10.1093/nar/gkg534PubMed CentralView ArticlePubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche BA, de CE, Lachaize C, Langendijk-Genevaux PS, Sigrist CJ: The 20 years of PROSITE. Nucleic Acids Res 2008, 36: D245-D249.PubMed CentralView ArticlePubMedGoogle Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, et al.: InterPro: the integrative protein signature database. Nucleic Acids Res 2009, 37: D211-D215. 10.1093/nar/gkn785PubMed CentralView ArticlePubMedGoogle Scholar
- Chikhi R, Sael L, Kihara D: Real-time ligand binding pocket database search using local surface descriptors. Proteins 2010, 78: 2007–2028. 10.1002/prot.22715PubMed CentralView ArticlePubMedGoogle Scholar
- La D, Esquivel-Rodriguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt S, Kihara D: 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics 2009, 25: 2843. 10.1093/bioinformatics/btp542PubMed CentralView ArticlePubMedGoogle Scholar
- Sael L, Kihara D: Binding Ligand Prediction for Proteins Using Partial Matching of Local Surface Patches. International Journal of Molecular Sciences 2010, 11: 5009–5026. 10.3390/ijms11125009PubMed CentralView ArticlePubMedGoogle Scholar
- Brun C, Chevenet F, Martin D, Wojcik J, Guenoche A, Jacq B: Functional classification of proteins for the prediction of cellular function from a protein-protein interaction network. Genome Biol 2003, 5: R6. 10.1186/gb-2003-5-1-r6PubMed CentralView ArticlePubMedGoogle Scholar
- Hawkins T, Chitale M, Kihara D: Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP. BMC Bioinformatics 2010, 11: 265. 10.1186/1471-2105-11-265PubMed CentralView ArticlePubMedGoogle Scholar
- Chitale M, Hawkins T, Kihara D: Automated prediction of protein function from sequence. In Prediction of protein strucutre, functions, and interactions. Edited by: Bujnick J. Wiley Online Library; 2009:63–86.Google Scholar
- Chitale M, Kihara D: Computational protein function prediction: Framework and challenges. In Protein function prediction for omis era. Volume Chapter 1. Edited by: Kihara D. Springer Verlag; 2011:1–17.View ArticleGoogle Scholar
- Chitale M, Kihara D: Enhanced Sequence-Based Function Prediction Methods and Application to Functional Similarity Networks. In Protein Function Prediction for Omics Era. Volume Chapter 2. Edited by: Kihara D. Springer Verlag; 2011:19–34.View ArticleGoogle Scholar
- Eisenberg D, Marcotte EM, Xenarios I, Yeates TO: Protein function in the post-genomic era. Nature 2000, 405: 823–826. 10.1038/35015694View ArticlePubMedGoogle Scholar
- Friedberg I: Automated protein function prediction--the genomic challenge. Brief Bioinform 2006, 7: 225–242. 10.1093/bib/bbl004View ArticlePubMedGoogle Scholar
- Valencia A: Automatic annotation of protein function. Curr Opin Struct Biol 2005, 15: 267–274. 10.1016/j.sbi.2005.05.010View ArticlePubMedGoogle Scholar
- Bork P, Koonin EV: Predicting functions from protein sequences--where are the bottlenecks? Nat Genet 1998, 18: 313–318. 10.1038/ng0498-313View ArticlePubMedGoogle Scholar
- Devos D, Valencia A: Practical limits of function prediction. Proteins 2000, 41: 98–107. 10.1002/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-SView ArticlePubMedGoogle Scholar
- Chitale M, Hawkins T, Park C, Kihara D: ESG: extended similarity group method for automated protein function prediction. Bioinformatics 2009, 25: 1739–1745. 10.1093/bioinformatics/btp309PubMed CentralView ArticlePubMedGoogle Scholar
- Hawkins T, Luban S, Kihara D: Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci 2006, 15: 1550–1556. 10.1110/ps.062153506PubMed CentralView ArticlePubMedGoogle Scholar
- Hawkins T, Chitale M, Luban S, Kihara D: PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins 2009, 74: 566–582. 10.1002/prot.22172View ArticlePubMedGoogle Scholar
- Martin DM, Berriman M, Barton GJ: GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 2004, 5: 178. 10.1186/1471-2105-5-178PubMed CentralView ArticlePubMedGoogle Scholar
- Vinayagam A, del VC, Schubert F, Eils R, Glatting KH, Suhai S, Konig R: GOPET: a tool for automated predictions of Gene Ontology terms. BMC Bioinformatics 2006, 7: 161. 10.1186/1471-2105-7-161PubMed CentralView ArticlePubMedGoogle Scholar
- Wass MN, Sternberg MJ: ConFunc--functional annotation in the twilight zone. Bioinformatics 2008, 24: 798–806. 10.1093/bioinformatics/btn037View ArticlePubMedGoogle Scholar
- Zehetner G: OntoBlast function: From sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res 2003, 31: 3799–3803. 10.1093/nar/gkg555PubMed CentralView ArticlePubMedGoogle Scholar
- Chua HN, Sung WK, Wong L: Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics 2006, 22: 1623–1630. 10.1093/bioinformatics/btl145View ArticlePubMedGoogle Scholar
- Gao L, Li X, Guo Z, Zhu M, Li Y, Rao S: Widely predicting specific protein functions based on protein-protein interaction data and gene expression profile. Sci China C Life Sci 2007, 50: 125–134. 10.1007/s11427-007-0009-1View ArticlePubMedGoogle Scholar
- Hishigaki H, Nakai K, Ono T, Tanigami A, Takagi T: Assessment of prediction accuracy of protein function from protein--protein interaction data. Yeast 2001, 18: 523–531. 10.1002/yea.706View ArticlePubMedGoogle Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19(Suppl 1):i197-i204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
- Markowetz F, Troyanskaya OG: Computational identification of cellular networks and pathways. Mol Biosyst 2007, 3: 478–482. 10.1039/b617014pView ArticlePubMedGoogle Scholar
- Nariai N, Kolaczyk ED, Kasif S: Probabilistic protein function prediction from heterogeneous genome-wide data. PLoS One 2007, 2: e337. 10.1371/journal.pone.0000337PubMed CentralView ArticlePubMedGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol 2000, 18: 1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3: 88.PubMed CentralView ArticlePubMedGoogle Scholar
- Aranda B, Achuthan P, am-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, et al.: The IntAct molecular interaction database in 2010. Nucleic Acids Res 2010, 38: D525-D531. 10.1093/nar/gkp878PubMed CentralView ArticlePubMedGoogle Scholar
- Breitkreutz BJ, Stark C, Reguly T, Boucher L, Breitkreutz A, Livstone M, Oughtred R, Lackner DH, Bahler J, Wood V, et al.: The BioGRID Interaction Database: 2008 update. Nucleic Acids Res 2008, 36: D637-D640.PubMed CentralView ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, et al.: STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009, 37: D412-D416. 10.1093/nar/gkn760PubMed CentralView ArticlePubMedGoogle Scholar
- Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stumpflen V, Mewes HW, et al.: The MIPS mammalian protein-protein interaction database. Bioinformatics 2005, 21: 832–834. 10.1093/bioinformatics/bti115View ArticlePubMedGoogle Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32: D449-D451. 10.1093/nar/gkh086PubMed CentralView ArticlePubMedGoogle Scholar
- Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, Reddy TB, Wymore F, Zachariah ZK, Sherlock G, Ball CA: Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 2009, 37: D898-D901. 10.1093/nar/gkn786PubMed CentralView ArticlePubMedGoogle Scholar
- Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, et al.: ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 2010.Google Scholar
- Ahrens CH, Brunner E, Qeli E, Basler K, Aebersold R: Generating and navigating proteome maps using mass spectrometry. Nature Reviews Molecular Cell Biology 2010, 11: 789–801. 10.1038/nrm2973View ArticlePubMedGoogle Scholar
- Van Vliet AHM: Next generation sequencing of microbial transcriptomes: challenges and opportunities. FEMS microbiology letters 2010, 302: 1–7. 10.1111/j.1574-6968.2009.01767.xView ArticlePubMedGoogle Scholar
- Nagalakshmi U, Waern K, Snyder M: RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol 2010, 89: 1–13.Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Sheehan B, Quigley A, Gaudin B, Dobson S: A relation based measure of semantic similarity for Gene Ontology annotations. BMC Bioinformatics 2008, 9: 468. 10.1186/1471-2105-9-468PubMed CentralView ArticlePubMedGoogle Scholar
- Lee JH, Kim MH, Lee YJ: Information retrieval based on conceptual distance in IS-A hierarchies. Journal of Documentation 1993, 49: 188–207. 10.1108/eb026913View ArticleGoogle Scholar
- Resnik P: Using information content to evaluate semantic similarity in a taxonomy. The proceedings of 14th International Joint Conference on Artificial Intelligence 1995, 448–453.Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19: 1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Lin D: An information-theoretic definition of similarity. The proceedings of the 15th International Conference on Machine Learning 1998, 296–304.Google Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006, 7: 302. 10.1186/1471-2105-7-302PubMed CentralView ArticlePubMedGoogle Scholar
- Curtis RK, Oresic M, Vidal-Puig A: Pathways to the analysis of microarray data. Trends Biotechnol 2005, 23: 429–435. 10.1016/j.tibtech.2005.05.011View ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81: 98–104. 10.1016/S0888-7543(02)00021-6View ArticlePubMedGoogle Scholar
- Zheng B, Lu X: Novel metrics for evaluating the functional coherence of protein groups via protein semantic network. Genome Biol 2007, 8: R153. 10.1186/gb-2007-8-7-r153PubMed CentralView ArticlePubMedGoogle Scholar
- Chagoyen M, Carazo JM, Pascual-Montano A: Assessment of protein set coherence using functional annotations. BMC Bioinformatics 2008, 9: 444. 10.1186/1471-2105-9-444PubMed CentralView ArticlePubMedGoogle Scholar
- Pandey J, Koyuturk M, Subramaniam S, Grama A: Functional coherence in domain interaction networks. Bioinformatics 2008, 24: i28-i34. 10.1093/bioinformatics/btn296View ArticlePubMedGoogle Scholar
- Pandey J, Koyuturk M, Grama A: Functional characterization and topological modularity of molecular interaction networks. BMC Bioinformatics 2010, 11(Suppl 1):S35. 10.1186/1471-2105-11-S1-S35PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, 35: D5–12. 10.1093/nar/gkl1031PubMed CentralView ArticlePubMedGoogle Scholar
- Barrell D, Dimmer E, Huntley RP, Binns D, O'Donovan C, Apweiler R: The GOA database in 2009--an integrated Gene Ontology Annotation resource. Nucleic Acids Res 2009, 37: D396-D403. 10.1093/nar/gkn803PubMed CentralView ArticlePubMedGoogle Scholar
- Skolnick J, Jaroszewski L, Kolinski A, Godzik A: Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Protein Sci 1997, 6: 676–688.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang YD, Park C, Kihara D: Threading without optimizing weighting factors for scoring function. Proteins 2008, 73: 581–596. 10.1002/prot.22082View ArticlePubMedGoogle Scholar
- The Gene Ontology in 2010: extensions and refinements Nucleic Acids Res 2010, 38: D331-D335.
- Inohara N, Iwamoto A, Moriyama Y, Shimomura S, Maeda M, Futai M: Two genes, atpC1 and atpC2, for the gamma subunit of Arabidopsis thaliana chloroplast ATP synthase. Journal of Biological Chemistry 1991, 266: 7333.PubMedGoogle Scholar
- Del Campo M, Lambowitz AM: Structure of the Yeast DEAD box protein Mss116p reveals two wedges that crimp RNA. Molecular cell 2009, 35: 598–609. 10.1016/j.molcel.2009.07.032View ArticlePubMedGoogle Scholar
- Klucken J, Bnchler C, Ors£ E, Kaminski WE, Porsch-+zcnrnmez M, Liebisch G, Kapinsky M, Diederich W, Drobnik W, Dean M: ABCG1 (ABC8), the human homolog of the Drosophila white gene, is a regulator of macrophage cholesterol and phospholipid transport. Proc Natl Acad Sci USA 2000, 97: 817–822. 10.1073/pnas.97.2.817PubMed CentralView ArticlePubMedGoogle Scholar
- Schumann RR, Leong SR, Flaggs GW, Gray PW, Wright SD, Mathison JC, Tobias PS, Ulevitch RJ: Structure and function of lipopolysaccharide binding protein. Science 1990, 249: 1429–1431. 10.1126/science.2402637View ArticlePubMedGoogle Scholar
- Wilde CG, Seilhamer JJ, McGrogan M, Ashton N, Snable JL, Lane JC, Leong SR, Thornton MB, Miller KL, Scott RW: Bactericidal/permeability-increasing protein and lipopolysaccharide (LPS)-binding protein. LPS binding properties and effects on LPS-mediated cell activation. Journal of Biological Chemistry 1994, 269: 17411–17416.PubMedGoogle Scholar
- Houseley J, Tollervey D: The nuclear RNA surveillance machinery: The link between ncRNAs and genome structure in budding yeast? Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms 2008, 1779: 239–246. 10.1016/j.bbagrm.2007.12.008View ArticleGoogle Scholar
- Wang B, Hurov K, Hofmann K, Elledge SJ: NBA1, a new player in the Brca1 A complex, is required for DNA damage resistance and checkpoint control. Genes & development 2009, 23: 729–739. 10.1101/gad.1770309View ArticleGoogle Scholar
- Wadsworth P: Regional regulation of microtubule dynamics in polarized, motile cells. Cell motility and the cytoskeleton 1999, 42: 48–59. 10.1002/(SICI)1097-0169(1999)42:1<48::AID-CM5>3.0.CO;2-8View ArticlePubMedGoogle Scholar
- Diaz-Martinez LA, Gimenez-Abian JF, Clarke DJ: Chromosome cohesion-rings, knots, orcs and fellowship. Journal of cell science 2008, 121: 2107–2114. 10.1242/jcs.029132View ArticlePubMedGoogle Scholar
- Murante RS, Rust L, Bambara RA: Calf 5 to 3 exo/endonuclease must slide from a 5 end of the substrate to perform structure-specific cleavage. Journal of Biological Chemistry 1995, 270: 30377–30383. 10.1074/jbc.270.51.30377View ArticlePubMedGoogle Scholar
- Binns D, Dimmer E, Huntley R, Barrell D, O'Donovan C, Apweiler R: QuickGO: a web-based tool for Gene Ontology searching 5. Bioinformatics 2009, 25: 3045–3046. 10.1093/bioinformatics/btp536PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res 2009, 37: 825–831. 10.1093/nar/gkn1005PubMed CentralView ArticlePubMedGoogle Scholar
- Agrawal R, Imieli ski T, Swami A: Mining association rules between sets of items in large databases. ACM SIGMOD Record 1993, 22: 207–216. 10.1145/170036.170072View ArticleGoogle Scholar
- Brijs T, Goethals B, Swinnen G, Vanhoof K, Wets G: A data mining framework for optimal product selection in retail supermarket data: the generalized PROFSET model. 300–304.
- Lawrence RD, Almasi GS, Kotlyar V, Viveros MS, Duri SS: Personalization of supermarket product recommendations. Data Mining and Knowledge Discovery 2001, 5: 11–32. 10.1023/A:1009835726774View ArticleGoogle Scholar
- Smyth P, Goodman RM: An information theoretic approach to rule induction from databases. Knowledge and Data Engineering, IEEE Transactions on 2002, 4: 301–316.View ArticleGoogle Scholar
- Quinlan JR: Learning logical definitions from relations. Machine learning 1990, 5: 239–266.Google Scholar
- Koike A, Niwa Y, Takagi T: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005, 21: 1227–1236. 10.1093/bioinformatics/bti084View ArticlePubMedGoogle Scholar
- Krallinger M, Padron M, Valencia A: A sentence sliding window approach to extract protein annotations from biomedical articles. BMC bioinformatics 2005, 6: S19.PubMed CentralView ArticlePubMedGoogle Scholar
- King OD, Foulger RE, Dwight SS, White JV, Roth FP: Predicting gene function from patterns of annotation 1. Genome Res 2003, 13: 896–904. 10.1101/gr.440803PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs 2. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258-D261. 10.1093/nar/gkh036View ArticlePubMedGoogle Scholar
- The Universal Protein Resource (UniProt) 2009 Nucleic Acids Res 2009, 37: D169-D174.
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, et al.: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res 2002, 30: 69–72. 10.1093/nar/30.1.69PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.