A transversal approach to predict gene product networks from ontology-based similarity
© Chabalier et al; licensee BioMed Central Ltd. 2007
Received: 10 November 2006
Accepted: 02 July 2007
Published: 02 July 2007
Interpretation of transcriptomic data is usually made through a "standard" approach which consists in clustering the genes according to their expression patterns and exploiting Gene Ontology (GO) annotations within each expression cluster. This approach makes it difficult to underline functional relationships between gene products that belong to different expression clusters. To address this issue, we propose a transversal analysis that aims to predict functional networks based on a combination of GO processes and data expression.
The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model. Through a weighting scheme over the annotations, we take into account the representativity of the terms that annotate a gene product. Comparing annotation vectors results in a matrix of gene product similarities. Combined with expression data, the matrix is displayed as a set of functional gene networks. The transversal approach was applied to 186 genes related to the enterocyte differentiation stages. This approach resulted in 18 functional networks proved to be biologically relevant. These results were compared with those obtained through a standard approach and with an approach based on information content similarity.
Complementary to the standard approach, the transversal approach offers new insight into the cellular mechanisms and reveals new research hypotheses by combining gene product networks based on semantic similarity, and data expression.
Interpretation of data resulting from high-throughput analyses is a challenge in bioinformatics. Two major information sources are usually used to make this interpretation: expression data and biological annotations mainly based on the Gene Ontology™ (GO) . According to Eisen et al., expression data organize genes into functional categories . Genes that are expressed together share common functions. Therefore, the interpretation of microarray experimental data is usually performed through the following "standard" approach: 1) the genes are organized into clusters depending on their differential expression pattern and, 2) for each cluster, the main objective is to translate the list of genes into a functional profile able to offer insight into the cellular mechanisms relevant in the given condition . Several tools have been proposed for ontological analysis of gene expression data (for review see ). Among them, following the standard approach used to interpret expression data, Gibbons and Roth proposed to judge the quality of the expression-based clustering methods using GO terms . However, as argued in [6, 7], complex biological functions emerge from interactions between gene products. Integrated systems, defined as the assembling of individual gene products in such complexes, can collaborate in broader biological processes. For example, in Bacillus subtilis, an ABC transporter and a two-component regulatory system, respectively involved in transport and signal transduction, collaborate into a same biological process: antibiotic resistance . Therefore, if different functions can be involved in a common biological process, we can make the assumption that genes can be differentially expressed in such a process. Consequently, the standard approach makes it difficult to underline functional relationships between gene products when they belong to different expression clusters.
Complementary to the standard approach, we define a transversal analysis that aims to predict functional networks of gene products based on the biological processes they belong to. Simultaneously, genes involved in such networks are clustered according to their expression patterns. The combined visualization of functional networks and expression clusters is expected to offer new insight on the roles of the gene products. We propose to use the ontological-based similarity to predict functional gene product networks. Based on the GO term similarity, the semantic similarity between gene products consists in the comparison of the different terms assigned to a pair of gene products. Typically, two approaches can be performed to compute the term similarity into hierarchies. The path based method relies on the edge-counting approach defined in . The shorter the path one node to the other, the more similar they are. However, the semantic distances between any two adjacent nodes are not necessarily equal. Indeed, the distance shrinks as one descends the hierarchy, since differentiation is based on finer and finer details. The information content method is based on Lin, Jiang and Resnik measures [10–12]. This approach relies on the frequency of a concept in a large corpus. Based on this approach, ongoing works propose to establish functional relationships between gene products [13–16]. As discussed in , the information content approach tends to give better results for the term similarities than the path based method. However, applied to the gene similarity, it does not always meaningfully estimate similarity between genes because it does not take into account the hierarchy organizing terms (e.g. ).
The transversal approach presented in this paper consists in computing the semantic similarity between gene products in a Vector Space Model . Gene products are described as vectors of GO terms. The major contribution of this approach is the possibility of using a weighting scheme over the annotations. The comparison of such annotation vectors results in a matrix of gene similarity. Combined with expression data, the matrix is displayed as a set of functional gene networks. Each gene-gene relation is associated to the shared annotations. Hierarchy issues are addressed by an 1) a priori selection of terms according to a pre-determined level of abstraction and 2) a posteriori refinement of data interpretation to focus on a particular biological process. The transversal analysis was applied to a set of differentially expressed genes related to enterocyte differentiation. These genes were previously studied by a standard approach .
This paper is organized as follows. First, biological results and their comparison with the KEGG pathways are presented and discussed, then the transversal analysis methodology is detailed.
Overview of the transversal approach
starting with a collection of gene products that have been clustered according to their expression with an expression clustering-based method;
selection of GO terms associated with each gene product according to an a priori level of abstraction (apLev) (Figure 1a);
construction of a weighted term vector for each gene product (Figure 1b);
pairwise comparison of these vectors in a Vector Space Model. This comparison results in a half-matrix of gene product similarities (Figure 1c);
selection of a similarity threshold to obtain the pairs of gene products that have a high degree of similarity;
displaying the resulting pairs of gene products associated with their corresponding expression clusters (Figure 1d). A gene product pair is displayed as two nodes linked by an edge. It results in a set of "transversal networks". The most frequent terms are used to describe each network as a biological profile (Figure 1e).
At this step, the resulting networks are biologically interpreted. This analysis can be refined by performing several runs at various levels of abstraction (named a posteriori levels). Gene products that are associated with finer-grained GO terms are then grouped together under more general categories.
A detailed description of the methodology is provided in the Methods section.
The transversal analysis was applied to a set of genes related to enterocyte differentiation. These genes were previously studied by a standard approach . In this paper, we refer to this set of genes as the Bedrine-Ferran gene set (BF set). As CaCo-2 cells spontaneously differentiate in enterocytes, this cell line was used to characterize genes whose expression varies during differentiation by means of microarray experiments. The authors performed a clustering with Self-organized Maps (see Methods section) and the resulting expression clusters are used in our approach combined with the transversal networks. These experiments led to the identification of 186 significant genes through the in vitro differentiation process: 50 were down-regulated, 80 up-regulated and 56 were "invariant", i.e. their expression remained constant during the differentiation stages. We have applied the transversal analysis to the BF set. 187 distinct Biological Process terms related to 119 gene products were extracted (the 67 remaining gene products were not associated with any GO Biological Process term). As these terms are located at various hierarchy depths, we compared the different levels of abstraction in order to compute, at the most appropriate level, the semantic similarity between the gene products.
a priori level of abstraction
Computing semantic similarity
Biological profiles and gene expression related to the networks of more than two gene products
Number of genes
cellular macromolecule metabolism; protein metabolism; macromolecule biosynthesis; cellular biosynthesis
cellular lipid metabolism; lipid metabolism; cellular catabolism
organelle organization and biogenesis; DNA metabolism
amine metabolism; amino acid and derivative metabolism
cellular macromolecule metabolism; protein metabolism
cellular catabolism; generation of precursor metabolites and energy; carbohydrate metabolism
intracellular transport; protein transport; establishment of protein localization
Network 1 is depicted in Figure 4a. Its biological profile is protein metabolism/cellular biosynthesis. This network corresponds to a clique and, as described in the Methods section, this topology can be associated to a robust biological network. Three translation initiation factors (EIF4A2, EIF3S2, and EIF3S8), seven ribosomal proteins (RPS7, RPL13A, RPL41, RPL35A, RPL39, RPS3, and RPL7A) and two gene products involved in protein glycosylation (ALG8 and MAN2A1) are involved in this network. With three stages of the protein biosynthesis process – translation initiation, translation and post-translational modification – this network is functionally homogeneous. One might expect, in cellular proliferation, an overexpression of the genes involved in protein biosynthesis. While this is observed for the genes involved in translation initiation, the genes encoding ribosomal proteins are either invariant or down regulated. This down regulation pattern might be related to additional functions (such as transcription, RNA processing, DNA repair, inflammation) as argued in [22, 23]. Similarly to translation, initiation of the protein glycosylation in the endoplasmic reticulum might be activated in cell proliferation (down-regulation of ALG8), whereas the later glycosylation steps occurring in the Golgi apparatus might be invariant along the cell differenciation (invariant expression of MAN2A1).
Network 2 is related to the amine metabolism biological profile (Figure 4b). Eight gene products involved in arginine metabolism (GLS, ASS, CPS1 and GLUL), creatine biosynthesis (GATM), polyamine biosynthesis (ODC1 and SMS), and Selenoaminoacid metabolism (SEPHS2) are associated in this network. This network is functionally homogeneous. Its heterogeneous expression profile suggests that a specific biochemical pathway, leading to the creatine precursor (Guanidinoacetate), is activated during differentiation stages (up-regulation of GLS, GATM, ASS and CPS1), while the polyamine biosynthesis is repressed (down-regulation of ODC1 and SMS). For instance, it was proved that only the arginine metabolism is performed in the small intestine . Through this network, a potential NH3-detoxification role could be attributed to the enterocyte. The invariant expression of GLUL is explained by its role – e.g. glutamine synthetase – that is opposite to the GLS role – glutaminase – in the amine metabolism. Moreover, the GLUL expression is in accordance with the need of glutamate for complete CaCo-2 cell differentiation .
Network 3 is divided into two subnetworks connected by a specific gene product (MBTPS1) involved in both (Figure. 4c). The first subnetwork, lipid metabolism, is functionally homogeneous with nine gene products characterizing the CaCo-2 cells, models for intestinal lipoprotein synthesis and secretion . Four apolipoproteins involved in lipid transport (APOA1, APOB, APOC and APOM) and five gene products involved in cholesterol metabolism (MBTPS1, HMGCS1, ACAS2), androgen metabolism (UGT2B17) and arachidonic acid metabolism (AKR1C3) are gathered through this subnetwork. Seven genes belonging to lipid metabolism appear to be up-regulated, due to the role of diffentiated enterocytes that increase lipid uptake, metabolism and packaging . Conversely, HMGCS1, key enzyme of the cholesterol synthesis, is down-regulated during the differentiation stage. As argued in , this enzyme is transcriptionally repressed by an increase of cholesterol in the cell. MBTPS1, through a specific degradation role, is also involved in the cholesterol biosynthesis . The second subnetwork is related to cellular catabolism with three gene products involved in ubiquitine conjugaison (UBE2D1), digestion (MEP1A), and apoptosis (RNF128). MBTPS1, with its degradation role in cholesterol metabolism, is involved in the two subnetworks and represents the only connection between them. All the gene products share a degradation function. Therefore, this subnetwork could be considered as functionally homogeneous according to the catabolism profile. The heterogeneity in gene expression is due to the wide coverage of catabolism. MBTPS1, with its specific role in cholesterol metabolism, represents the only connection between the two subnetworks.
a posteriori level of abstraction
In order to evaluate the resulting transversal networks, we compared them with the KEGG PATHWAY database which is the reference database as regards the biochemical pathways (including most of the known metabolic pathways and some of the regulatory pathways). KEGG pathways are structured according to a three-level hierarchy including six major root classes: Metabolism, Genetic Information Processing, Environmental Information Processing, Cellular Processes, Human Diseases and Drug Development. The third level of this hierarchy corresponds to the KEGG pathways. A relation between a third level term and a gene product in the KEGG pathway database is considered as a KEGG annotation. The second level terms correspond to broader biological pathways.
A KEGG local relational database was built from the hierarchy available from the KEGG website. The database was used for querying and evaluating the transversal networks. The comparison was manually done.
Each transversal network was compared with the KEGG pathways. Among the 79 gene products resulting from the transversal analysis, 43 are annotated with KEGG. These gene products are associated with 49 pathways. Among the 18 transversal networks, 10 can be evaluated, i.e. they are composed of at least two gene products present in KEGG. Six transversal networks were consistent with KEGG: in four cases, KEGG annotations are identical or correspond to sibling KEGG pathways (i.e. pathways are subsumed by the same two-level term); for one network, KEGG annotations correspond to closely related two-level terms (Amino acid metabolism/Metabolism of others amino acids); in one case, KEGG annotations are different but reflect the composition of the networks into subnetworks (in network 3, gene products are annotated with Lipid metabolism and Folding, Sorting and Degradation).
The four remaining networks are heterogenous. However, from a biological point of view, the KEGG annotations are complementary. For example, the network 1 is associated with Translation, Transcription and Glycan Biosynthesis and metabolism. While the Translation and the Transcription derive from Genetic Information Processing, the last one corresponds to Metabolism although this pathway is related to the post-translation modifications.
KEGG results for the three networks
Transversal analysis network
Protein metabolism; Cellular biosynthesis
Glycan Biosynthesis and metabolism
Amino Acid Metabolism Metabolism of Other Amino Acids
Lipid metabolism; Cellular catabolism
Folding, Sorting and Degradation
This paper presents a new approach to microarray data interpretation which combines gene product interaction networks with data expression and offers new insight into the cellular mechanisms. The biological networks obtained by this approach rely on ontology-based similarity which is computed through the VSM. We first presented a preliminary study of this approach in  and we have enriched it with the definition of a conceptual distance metric. This improvement of the path approach is used to select the GO Biological Process terms in each annotation vector. The transversal analysis was applied to a collection of genes related to the enterocyte differentiation. 18 functional networks involving 79 gene products were obtained. The 26 unrepresented gene products present a similarity degree that is under the selected threshold. To measure the significance of the gene product networks resulting from the transversal analysis, recall and precision were calculated by comparing the gene product classification resulting from our method against a gold standard: the KEGG pathway database. We estimated the precision to be 81.8% and the recall to be 83.7%. Moreover, the "false positives" were judged biological relevant by the experts, proving that our method provides the biologists with valuable information. The "false negatives" were due to the incompleteness of the Gene Ontology annotation. Furthermore, the experts considered the resulting networks to be functionally homogeneous. The gene expression differs within each network (Table 1), highlighting that some specific processes are activated under some conditions. For example, network 2 suggests a new potential pathway related to amine metabolism that could be activated during the differentiation stage. This result emphasizes the contribution of the transversal analysis to suggest new research hypotheses. Furthermore, while the transversal network 1 (protein biosynthesis) presents an expression heterogeneity which is biologically relevant, the standard approach did not highlight this expression fluctuation during the cellular differentiation process (See additional file 3: Standard approach comparison). The graphviz software is used to visualize the resulting networks . Coupling this software with web technology assists the biological interpretation of transversal networks by associating each node (gene product) to its GO annotations and to the GO terms that are shared by all the gene products (networks are presented on a web page where the annotations are reachable by clicking on each node). By using different values for GO interval and threshold, the visualization of the networks is kept readable even if the number of studied gene products increases. While our method aims to retrieve all the networks related to dedicated microarrays, different strategies can be achieved in the case of pangenomic microarrays. For example, the method can be used to retrieve specific networks associated to fine-grained terms (by increasing the Lev parameter). Moreover, a first run can select the networks of interest which can be refined by a second run.
Major results with Azuaje method. Azuaje approach results in 52 gene products categorized as 12 networks from two to 14 gene products. The first Azuaje network merges two transversal networks. The gene products belonging to the transversal network 3 are emphasized with bold-face.
Network 1 + Network 3
Protein metabolism; Cellular Biosynthesis; Lipid metabolism
Partial Network 2 (50%genes products missing)
To support the prediction role of the transversal approach, we take into account all the GO terms that are associated with gene products in annotation databases whatever the evidence codes. Indeed, taking into account only non-IEA (not inferred from electronic annotation) codes could result in using only 40% of the annotations provided by the Gene Ontology Annotation database (GOA; ). Therefore, we have chosen to favor a high number of biological assumptions rather than the reliability of annotations. In the same way, the GO interval addresses the issue of the various GO branch depth by selecting annotations independently of the tree depth.
The a posteriori level of abstraction section shows the importance of selecting the granularity to study biological processes. To our knowledge, this granularity is not taken into account in other approaches based on the GO semantic similarity. The way to restrict GO terms in our approach is complementary to using a GO slim  to perform analysis. Indeed, while GO slims are mainly domain-dependent, the restriction that is performed in our approach depends on the set of terms that annotate the gene products and their level in the GO hierarchy.
The GO annotation is remarkably useful for the mining of functional and biological significance from large datasets, such as microarray results . However, the transversal analysis results reflect some gaps in GO annotation databases. Indeed, the network 1 (protein metabolism/cellular biosynthesis) includes some ribosomal proteins. While some publications confirm that these gene products can be involved in replication, DNA repair or inflammation processes, there is no relation between these processes and the ribosomal proteins in GOA. The type of process itself can also cause difficulty in network interpretation. Some transversal high-level processes (e.g. catabolism in the network 3), gather gene products that are involved in several processes (e.g. apoptosis or digestion for this network).
The KEGG hierarchy is used to classify the biochemical pathways. However, the lack of relation between the KEGG root classes can introduce a bias. Indeed, translation and post-translation modification correspond to independent classes in the KEGG hierarchy (Genetic Information Processing and Metabolism). With regards to these results, we consider using the GO relations found in the resulting networks to enrich our KEGG local database. While 79 gene products are represented in the 18 transversal analysis networks, 43 are present in 49 KEGG pathways. Whereas the relative high number of pathways is partly due to a finer granularity of their description, some of them are not present in the GO (such as the Human Diseases class related pathways). Therefore, we consider taking into account the KEGG data in the transversal approach. In addition, we will have to evaluate the relative contribution of GO and KEGG vocabularies in order to weight the terms during the VSM step of the transversal analysis. By adding KEGG terms, we expect to improve the definition of the transversal networks.
Currently, we are working on an improvement of the transversal approach by weighting the levels according to the local density of each node as suggested in . Future work will consist in comparing and possibly merging our approach with literature networks as described in . Furthermore, we plan to consider a measure of functional diversity, as the functional entropy discussed in , in order to support the evaluation of the transversal networks.
This paper presents a new approach to microarray result interpretation which aims to combine gene product interaction networks with data expression. The resulting transversal networks are proved to be biologically consistent and offer new insight into the cellular mechanisms. Furthermore, the comparison with a standard approach corroborates the contribution of the transversal approach and underlines the complementarity of these two approaches.
Two major points reflect the novelty of the methodology developed to construct the transversal networks: the selection of an annotation level and the use of a weighting scheme over the annotations prior to compute the semantic similarity. The former point avoids the artefact due to the arbitrary fluctuation of the GO depth and in addition takes into account the granularity of the studied biological processes. The latter point considers the representativity of the annotations associated with the set of studied gene products. The comparison with gene products clustering derived from an approach based on information content highlights the contribution of the transversal approach to the construction of gene product networks. Finally, the comparison with the use of a biological vocabulary differently structured (i.e. KEGG hierarchy) proved that using a weighting scheme over the GO annotations and computing the similarity between gene products in a VSM constitute an efficient mean to construct gene product networks and consequently to interpret microarray results.
The GO terms are organized according to three independent hierarchies: Biological Process, Molecular Function, and Cellular Component. The transversal analysis uses the Gene Ontology Annotation file (GOA; ) to provide assignments of GO terms to gene products. It is restricted to terms from the Biological Process hierarchy in order to retrieve genes functionally related to a given biological process.
Computing gene similarities
A Vector Space Model (VSM) is used to compute similarity between pairs of gene products. VSM are essentially used in information retrieval for computing the similarity between documents described as vectors of keywords. Recently, this method has been used to identify associative relations between terms in the GO . The transversal analysis uses VSM to compute the similarity between gene products described as vectors of GO terms. A gene product is represented by a specific vector g as follows:
g = (t1, t2,..., t n )
Where t i is the numeric value that the term i takes on for this gene product and n is the number of GO terms associated with the set of gene products. For example, t i = 0 when there is no association between the GO term and the gene product in GOA. Since different terms have different importance for a gene product, a term weight is associated with each term. Lower weights are assigned to less important terms. In standard VSM, a common approach uses the idf method in which the weight of a term is determined by the way this term occurs in the whole document collection (inverse document frequency) . In the case of a gene product collection, we consider that a term is not representative of a gene product if it annotates most of the gene products in the collection. The term weight (w t ) is inversely proportional to the ratio of the number of gene products annotated by the term t (n t ) to the total number of annotated gene products in the collection (N):
w t = idf t = logN/n t
Once the term weights are determined, a gene product is represented by the following specific vector:
g = (w1, w2,..., w n )
The semantic similarity between two gene products varies from 0 (no similarity) to 1 (complete similarity). Similarity is computed pairwise for all the gene products of the collection. It results in a half-matrix of gene product similarity. A threshold must be chosen for the inner product in order to select the pairs of gene products that present a high degree of similarity. This threshold has to 1) select the pairs of genes that have a high degree of similarity, 2) result in a high number of networks and 3) have biological significance: each network must contain enough gene products to offer new insight into cellular mechanisms. Each gene product involved in the half-matrix is associated with the expression cluster it belongs to. These expression clusters can result from : 1) an expression clustering-based method that can be combined with our transversal approach (e.g. hierarchical clustering, k-means clustering or Self-Organized Maps (SOM) which are the most widely used in analysis of gene-gene expression data (for review see ), 2) a specific database (as Gene Expression Omnibus; ), or 3) the literature.
The gene product pairs that show a degree of similarity higher than the threshold are displayed as two nodes linked by an edge. This results in a set of functional networks. We use Graphviz , an open source software developed at AT&T Labs, in order to visualize these networks. The node shape and color correspond to the expression cluster each gene belongs to. An edge between two nodes is associated with the degree of similarity between two gene products according the given threshold. Each edge is typed by the GO terms shared by the two gene products. A biological profile is defined for each network; it corresponds to the most frequent terms typing the edges of a network. The network interpretation is firstly based on graph theory, for example a network is named a clique (or complete graph) if each pair of nodes is joined by an edge. Applied to gene products, this graph property highlights a robust biological network. On the other hand, the biological profiles assist the evaluation of the resulting networks.
Conceptual distance metric
Lev values are calculated for all the GO terms associated with the collection of gene products. The space of Lev values is divided into ten intervals. Each interval must contain only one value of Lev(t) in order to avoid the integration of hierarchical related terms in the vectors used to compute semantic similarity. The most informative level interval has to be selected to compute the semantic similarity between the genes. While deepest levels in the hierarchy contribute to a better characterization of gene products, each specific category does not appear to be significant because there are only few gene products associated with it.
The transversal analysis can be refined by performing several runs repeatedly, at various a posteriori levels of abstraction. Gene products that are associated with finer-grained GO terms are then grouped together under more general categories. In addition to retrieving more significant terms, such a posteriori levels can be used to focus on broader biological processes (e.g. ion transport rather than cation transport, anion transport, etc.).
We gratefully acknowledge Olivier Dameron for helpful discussions. This work was supported by a grant from the Region Bretagne (PRIR).
- Consortium GO: The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 2006, 34: D322-326. 10.1093/nar/gkj021.View ArticleGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998, 95 (25): 14863-14868. 10.1073/pnas.95.25.14863.PubMed CentralView ArticlePubMedGoogle Scholar
- Sun H, Fang H, Chen T, Perkins R, Tong W: GOFFA: Gene Ontology For Functional Analysis - A FDA Gene Ontology Tool for Analysis of Genomic and Proteomic Data. BMC Bioinformatics. 2006, 7 Suppl 2: S23-10.1186/1471-2105-7-S2-S23.View ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565.PubMed CentralView ArticlePubMedGoogle Scholar
- Gibbons FD, Roth FP: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res. 2002, 12 (10): 1574-1581. 10.1101/gr.397002.PubMed CentralView ArticlePubMedGoogle Scholar
- Chabalier J, Capponi C, Quentin Y, Fichant G: ISYMOD: a knowledge warehouse for the identification, assembly and analysis of bacterial integrated systems. Bioinformatics. 2005, 21 (7): 1246-1256. 10.1093/bioinformatics/bti137.View ArticlePubMedGoogle Scholar
- Quentin Y, Chabalier J, Fichant G: Strategies for the identification, the assembly and the classification of integrated biological systems in completely sequenced genomes. Comput Chem. 2002, 26 (5): 447-457. 10.1016/S0097-8485(02)00007-4.View ArticlePubMedGoogle Scholar
- Joseph P, Fichant G, Quentin Y, Denizot F: Regulatory relationship of two-component and ABC transport systems and clustering of their genes in the Bacillus/Clostridium group, suggest a functional link between them. J Mol Microbiol Biotechnol. 2002, 4 (5): 503-513.PubMedGoogle Scholar
- Rada R, Bicknell E: Ranking documents with a thesaurus. J Am Soc Inf Sci. 1989, 40 (5): 304-310. 10.1002/(SICI)1097-4571(198909)40:5<304::AID-ASI2>3.0.CO;2-6.View ArticlePubMedGoogle Scholar
- Lin D: An information-theoretic definition of similarity. 15th International Conference on Machine Learning; Madison, WI. 1998Google Scholar
- Jiang J, Conrath D: Semantic Similarity based on Corpus Statistics and Lexical Taxonomy. International Conference on Research in Computational Linguistics; Taiwan. 1997Google Scholar
- Resnik P: Semantic Similarity in a Taxonomy: An Information-Based Meas-ure and its Applications to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research. 1995, 95-130.Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003, 19 (10): 1275-1283. 10.1093/bioinformatics/btg153.View ArticlePubMedGoogle Scholar
- Wang H, Azuaje F, Bodenreider O, Dopazo J: Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. IEEE2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology; La Jolla, CA, USA. 2004, 25-31.View ArticleGoogle Scholar
- Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics. 2006, 7: 302-10.1186/1471-2105-7-302.PubMed CentralView ArticlePubMedGoogle Scholar
- Chiang JH, Shin JW, Liu HH, Chin CL: GeneLibrarian: an effective gene-information summarization and visualization system. BMC Bioinformatics. 2006, 7: 392-10.1186/1471-2105-7-392.PubMed CentralView ArticlePubMedGoogle Scholar
- Budanitsky A, Hirst G: Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, Second meeting of the North American Chapter of the Association for Computational Linguistics; Pittsburgh. 2001Google Scholar
- Azuaje F, Wang H, Bodenreider O: Ontology-driven similarity approaches to supporting gene functional assessment. ISMB'2005 SIG meeting on Bio-ontologies. 2005Google Scholar
- Baeza-Yates R, Ribeiro-Neto B: Modern information retrieval. 1999, Addison-Wesley, New York, Harlow, EnglandGoogle Scholar
- Bedrine-Ferran H, Le Meur N, Gicquel I, Le Cunff M, Soriano N, Guisle I, Mottier S, Monnier A, Teusan R, Fergelot P, Le Gall JY, Leger J, Mosser J: Transcriptome variations in human CaCo-2 cells: a model for enterocyte differentiation and its link to iron absorption. Genomics. 2004, 83 (5): 772-789. 10.1016/j.ygeno.2003.11.014.View ArticlePubMedGoogle Scholar
- Transversal Approach. [http://www.ea3888.univ-rennes1.fr/TransversalApproach/]
- Wool IG: Extraribosomal functions of ribosomal proteins. Trends Biochem Sci. 1996, 21 (5): 164-165. 10.1016/0968-0004(96)20011-8.View ArticlePubMedGoogle Scholar
- Yamamoto T: Molecular mechanism of monocyte predominant infiltration in chronic inflammation: mediation by a novel monocyte chemotactic factor, S19 ribosomal protein dimer. Pathol Int. 2000, 50 (11): 863-871. 10.1046/j.1440-1827.2000.01132.x.View ArticlePubMedGoogle Scholar
- Brosnan ME, Brosnan JT: Renal arginine metabolism. J Nutr. 2004, 134 (10 Suppl): 2791S-2795S; discussion 2796S-2797S.PubMedGoogle Scholar
- Weiss MD, DeMarco V, Strauss DM, Samuelson DA, Lane ME, Neu J: Glutamine synthetase: a key enzyme for intestinal epithelial differentiation?. JPEN J Parenter Enteral Nutr. 1999, 23 (3): 140-146.View ArticlePubMedGoogle Scholar
- Levy E, Mehran M, Seidman E: Caco-2 cells as a model for intestinal lipoprotein synthesis and secretion. Faseb J. 1995, 9 (8): 626-635.PubMedGoogle Scholar
- Mariadason JM, Arango D, Corner GA, Aranes MJ, Hotchkiss KA, Yang W, Augenlicht LH: A gene expression profile that defines colon cell maturation in vitro. Cancer Res. 2002, 62 (16): 4791-4804.PubMedGoogle Scholar
- Field FJ, Born E, Murthy S, Mathur SN: Regulation of sterol regulatory element-binding proteins by cholesterol flux in CaCo-2 cells. J Lipid Res. 2001, 42 (10): 1687-1698.PubMedGoogle Scholar
- Nakajima T, Iwaki K, Kodama T, Inazawa J, Emi M: Genomic structure and chromosomal mapping of the human site-1 protease (S1P) gene. J Hum Genet. 2000, 45 (4): 212-217. 10.1007/s100380070029.View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32 (Database issue): D277-80. 10.1093/nar/gkh063.PubMed CentralView ArticlePubMedGoogle Scholar
- Chabalier J, Garcelon N, Aubry M, Burgun A: A transversal approach to compute semantic similarity between genes. Workshop on Biomedical Ontologies and Text Processing - European Conference on Computational Biology (ECCB'2005); Madrid, Spain. 2005Google Scholar
- Graphviz software . [http://www.graphviz.org]
- Harris DS, Slot JW, Geuze HJ, James DE: Polarized distribution of glucose transporter isoforms in Caco-2 cells. Proc Natl Acad Sci U S A. 1992, 89 (16): 7556-7560. 10.1073/pnas.89.16.7556.PubMed CentralView ArticlePubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.PubMed CentralView ArticlePubMedGoogle Scholar
- Wolting C, McGlade CJ, Tritchler D: Cluster analysis of protein array results via similarity of Gene Ontology annotation. BMC Bioinformatics. 2006, 7: 338-10.1186/1471-2105-7-338.PubMed CentralView ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32 (Database issue): D262-6. 10.1093/nar/gkh021.PubMed CentralView ArticlePubMedGoogle Scholar
- GO slim. [http://www.geneontology.org/GO.slims.shtml]
- Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, Binns D, Apweiler R: An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics. 2005, 6 Suppl 1: S17-10.1186/1471-2105-6-S1-S17.View ArticlePubMedGoogle Scholar
- Agirre E, Rigau G: Word sense disambiguation using conceptual density. l5th International Conference on Computational Linguistics, COLING'96; Copenhagen, Denmark. 1996Google Scholar
- Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001, 28 (1): 21-28. 10.1038/88213.PubMedGoogle Scholar
- Casbon J, Saqi M: Functional diversity within proteins superfamilies. Journal of Integrative Bioinformatics. 2006, 3 (2):
- Bodenreider O, Aubry M, Burgun A: Non-lexical approaches to identifying associative relations in the gene ontology. Pac Symp Biocomput. 2005, 91-102.Google Scholar
- Salton G, McGill M: Introduction to Modern Information Retrieval. 1983, New York: McGraw Hill CompaniesGoogle Scholar
- Singhal A, Salton G: Automatic Text Browsing Using Vector Space Model. Fifth Dual-Use Technologies and Applications Conference; Utica/Rome, NY. 1995, 318-324.Google Scholar
- Gerstein M, Jansen R: The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function?. Curr Opin Struct Biol. 2000, 10 (5): 574-584. 10.1016/S0959-440X(00)00134-2.View ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res. 2007, 35 (Database issue): D760-5. 10.1093/nar/gkl887.PubMed CentralView ArticlePubMedGoogle Scholar
- Mao X, Cai T, Olyarchuk JG, Wei L: Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary. Bioinformatics. 2005, 21 (19): 3787-3793. 10.1093/bioinformatics/bti430.View ArticlePubMedGoogle Scholar
- Rigau G, Atserias J, Agirre E: Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation. 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics ACL/EACL'97; Madrid, Spain. 1997Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.