DOSim: An R package for similarity between diseases based on Disease Ontology
- Jiang Li†1,
- Binsheng Gong†1,
- Xi Chen1,
- Tao Liu1,
- Chao Wu1,
- Fan Zhang1,
- Chunquan Li1,
- Xiang Li1,
- Shaoqi Rao1Email author and
- Xia Li1Email author
© Li et al; licensee BioMed Central Ltd. 2011
Received: 16 January 2011
Accepted: 29 June 2011
Published: 29 June 2011
The construction of the Disease Ontology (DO) has helped promote the investigation of diseases and disease risk factors. DO enables researchers to analyse disease similarity by adopting semantic similarity measures, and has expanded our understanding of the relationships between different diseases and to classify them. Simultaneously, similarities between genes can also be analysed by their associations with similar diseases. As a result, disease heterogeneity is better understood and insights into the molecular pathogenesis of similar diseases have been gained. However, bioinformatics tools that provide easy and straight forward ways to use DO to study disease and gene similarity simultaneously are required.
We have developed an R-based software package (DOSim) to compute the similarity between diseases and to measure the similarity between human genes in terms of diseases. DOSim incorporates a DO-based enrichment analysis function that can be used to explore the disease feature of an independent gene set. A multilayered enrichment analysis (GO and KEGG annotation) annotation function that helps users explore the biological meaning implied in a newly detected gene module is also part of the DOSim package. We used the disease similarity application to demonstrate the relationship between 128 different DO cancer terms. The hierarchical clustering of these 128 different cancers showed modular characteristics. In another case study, we used the gene similarity application on 361 obesity-related genes. The results revealed the complex pathogenesis of obesity. In addition, the gene module detection and gene module multilayered annotation functions in DOSim when applied on these 361 obesity-related genes helped extend our understanding of the complex pathogenesis of obesity risk phenotypes and the heterogeneity of obesity-related diseases.
DOSim can be used to detect disease-driven gene modules, and to annotate the modules for functions and pathways. The DOSim package can also be used to visualise DO structure. DOSim can reflect the modular characteristic of disease related genes and promote our understanding of the complex pathogenesis of diseases. DOSim is available on the Comprehensive R Archive Network (CRAN) or http://bioinfo.hrbmu.edu.cn/dosim.
The past several decades have seen a number of methods applied to the computation of similarities between diseases [1–4]. The early work used clinical phenotypes or diagnosed information. For example, Kalaria  ascertained similarities between Alzheimer's disease and vascular dementia by studying the similarities between disease symptoms and pathological result. More recently, with the availability of large-scale knowledge bases such as the Online Mendelian Inheritance in Man (OMIM)  and the Genetic Association Database (GAD) , scientists are able to explore the genetic similarity between diseases. In 2009, Liu et al.  revealed similarities between diseases by combining both genetic (data from GAD ) and environmental (data from Medical Subject Headings, MeSH ) factors and, by mining for disease etiologies, created a new concept named the "etiome". Zhang and his colleagues  used a text-based method to build up a human disease phenotype network in which a disease was represented by a feature vector and the similarities between two diseases were calculated as the cosine of the angle between their corresponding feature vectors. However, little work has been done to apply semantic similarity measures between diseases using ontology, another way to analyze relationship between diseases.
Understanding similarities between genes has a significant role to play in disease research. One hypothesis states that genes associated with similar diseases have similar functions; the greater the gene similarity the higher the probability that the genes are associated with similar similarity. However, current methods to determine gene similarity rely on sequence similarity, gene expression profiles, Gene Ontology (GO)  annotations or PubMed abstracts, all of which are derived from normal or partially abnormal conditions and it secludes gene similarity from disease similarity. Thus, a process to determine the similarities between genes in terms of diseases and to map gene similarities to disease similarities would help us better understand the mechanism of complex diseases.
Measuring the similarity between diseases
Terms in DO include disease names and disease-related concepts. Exploring the similarity between them can help us to understand the relatedness between diseases. The past few years have seen an increase in the number of different measures used for the calculation of semantic similarity. Based on the semantic similarity measures in the application of biomedical ontologies reviewed by Pesquita etc al. , for general applicability, in DOSim we implemented ten representative semantic similarity measures, which are Resnik measure , Lin measure , Jiang and Conrath measure (JC) , Relevance measure (Rel) , Graph Information Content measure (GIC) , Information Coefficient similarity measure (simIC) , Wang measure , modified Resnik measure (CoutoResnik) , modified Lin measure (CoutoLin) , and modified Jiang and Conrath measure (CoutoJC) . Except for the Wang measure that uses a hybrid measure, the other nine measures are based on information content (IC).
where w e is the semantic contribution factor of edge e (e ∈ E A ). It is set between 0 and 1 according to the types of relationships, e.g., "is-a" or "part-of". In DO, there is only one type of relationship, defined as "is-a". In DOSim, we set w e to 0.7.
Measuring the similarity between human genes in terms of diseases
In the DOSim package, the similarity between two genes based on the similarity of their DO term annotation groups is calculated. Each gene is represented by its set of direct DO term annotations, and semantic similarity is calculated between terms in one set and terms in the other (using one of the measures described above). Some methods consider every pairwise combination of terms for the two sets, while others consider only the best-matching pair for each term. Five different methods are implemented in DOSim; they are the arithmetic maxima and average of pairwise similarity between two groups of DO terms describing the two genes (Max, Mean) , the arithmetic maxima and average between similarities for two directional comparisons of the similarity matrix S of two genes (funSimMax, funSimAvg) , and the best-match average approach (BMA)  which considers the contributions from the semantically similar terms that annotated the two genes respectively (Formula 23).
For a set of genes G (g 1 ,g 2 ,...,g n ) of size n, the similarity matrix for these genes is defined as Sim=[Sim ij ] n×n , where Sim ij is the similarity between gene g 1 and g j derived by any of the five methods defined above.
In DOSim, there are a total of fifty optional semantic similarity measures for genes, which are combinations of the ten semantic similarity measures for term pairs and the five similarity methods mentioned above.
Conducting DO enrichment analysis
where, N is the total number of human genes in the genome, k is the size of the gene set of interest, and is the number of combinations of the N genes taken k at a time and is equal to .
Compared with FunDO, which uses a small set of DO terms (DOLite) , DOSim selects the DO terms satisfy two criteria for enrichment analysis, aiming at exploring more biological result. The first criterion is that the term should be annotated by at least n genes, and the second is that the term should be beneath a depth m in the DAG of DO, where n and m can be set by users when running the DO enrichment analysis.
In the DOSim package, the DOEnrichment function carries out the DO enrichment analysis; the input is a list of Entrez gene IDs. The filter and layer parameters are the two criteria mentioned above that can be used to control the terms to be analysed; so that the term is annotated by at least 'filter size' genes and it is beneath the 'layer' depth in the DAG of DO.
Detecting and annotating DO-directed gene modules
A gene module is a group of highly correlated genes. In DOSim, gene modules can be detected as follows: after the gene similarity matrix for a gene set is constructed, a hierarchical clustering is performed using the standard R function hclust and one of three branch cutting methods is applied (one constant-height cutting and two dynamic branch cutting methods are embed in our package) .
The DOSim package incorporates multilayered enrichment analysis (GO and KEGG annotation) to explore the biological meaning of the detected gene modules. The GO annotations are conducted using GOSim  and the KEGG annotations are generated using SubpathwayMiner . The input for GO and KEGG annotations is a list of Entrez gene IDs, the mechanism implied in each annotation database is the hypergeometric test, and the outputs for each annotation database are the enriched terms with p-values.
Describing and visualizing DO structures and terms
DO is a collection of terminologies associated with human diseases and the terms in DO are organised in a DAG (Figure 1). DOSim also provides useful utilities to easily visualise the DO structure; thus users need not turn to other tools (e.g., OBO-Edit). Specifically, the hierarchical structures of DO terms can be represented as a graphNEL object and the getDOGraph function in DOSim can be used to fetch the DO graph with specified DO terms at its leaves. For a certain DO term, DOSim provides a series of functions to extract related terms (e.g., father and child terms.).
The effect of different measures on the computation of gene similarity
Application on disease similarity
We investigated the relationships between different kinds of cancers using disease similarities derived from DOSim. First, 128 cancer disease DO terms were obtained by using "cancer" as the key word to search all DO term names (exclude the DO term, "DOID:162, cancer"). Then, we used the getTermSim function to get the pairwise similarities using Wang measure (This is an example here. Users can choose any of the other measures in their applications).
We also constructed the DO graph of these 128 cancers as leaves (Additional file 2), which finally contained 398 disease DO terms. We found that, as expected, diseases in the same module represented hierarchical structure in the DO graph as illustrated in the Figure S1. For example, the module marked brown contained 7 diseases, of which "cancer of urinary tract" (DOID:3996) is the ancestral node of the other 6 diseases. However, the observed correlation between "germ cell cancer" (DOID:2994) and the largest module which has a size of 22 (Figure 4) doesn't show any direct link in the DO graph. Again, the network representation in Figure 4 provided additional insights to our analysis.
Application on gene similarity
Gene modules of the obesity related genes
Representative GO annotation§
Representative KEGG annotation§
cholesterol homeostasis; high-density lipoprotein particle remodelling; triglyceride catabolic process
Insulin signaling pathway; Type II diabetes mellitus
Pyruvate metabolism; Galactose metabolism;
feeding behavior; photoreceptor cell maintenance
Neuroactive ligand-receptor interaction; Circadian rhythm - mammal;
response to estrogen stimulus; response to cytokine stimulus; cell aging
Pathways in cancer; Colorectal cancer; Endometrial cancer;
response to lipopolysaccharide; response to glucocorticoid stimulus
Cytokine-cytokine receptor interaction; Toll-like receptor signaling pathway;
positive regulation of phosphoinositide 3-kinase cascade; positive regulation of cholesterol esterification
Renin-angiotensin system; Prostate cancer
Insulin signaling pathway
blood coagulation; STAT protein nuclear translocation
Complement and coagulation cascades; Regulation of actin cytoskeleton
response to interleukin-1; response to glucocorticoid stimulus
Hematopoietic cell lineage; Cytokine-cytokine receptor interaction
When the complete GO and KEGG annotations of these ten different gene modules were analysed (Additional file 3), we found different enriched biology functions and pathways for each module, indicating the complex pathogenesis of obesity. For example, the KEGG annotations of one of the clusters (M4) (Table 1) indicated that obesity is a factor that may lead to various cancers (e.g., colorectal cancer and endometrial cancer) and that obesity may also have a relationship with many signalling pathways (e.g., ErbB signalling pathway and Jak-STAT signalling pathway). However, the KEGG annotations of another cluster (M2) suggested that obesity may either affect the metabolism of many molecules or that the dysfunctional metabolism of these molecules may lead to the obesity (e.g., pyruvate metabolism and galactose metabolism). Similarly, the GO annotations of cluster M1 implied that obesity has a relationship with the biology process of cholesterol, lipoprotein and triglyceride (e.g., cholesterol homeostasis, reverse cholesterol transport, high-density lipoprotein particle remodelling and triglyceride catabolic process), while the GO annotations of cluster M3 suggested that obesity may be associated with eating habits (e.g., feeding behavior and drinking behavior). Both the GO and KEGG annotations of cluster M8 indicated that obesity is related to coagulation (blood coagulation in GO; complement and coagulation cascades in KEGG). These multilayered annotations successfully demonstrated the complex pathogenesis of obesity and suggested that the genes in the different gene modules would be potential drug targets for the corresponding diseases caused by obesity.
The DOSim package offers an easy and straight forward way to study disease similarity and gene similarity simultaneously in the DO. Additionally, other utilities implemented in the DOSim, such as function of gene module detection and gene module multilayered annotation, make better application of the DO and facilitate researchers. The presented two case studies highlight the usefulness of the DOSim in a real life scenario. We also provided the Additional file 4 which contains all the necessary R scripts to generate the above two case studies.
The DOSim package advances the use of DO by integrating information theoretic similarity concepts for diseases and deriving disease similarity measures for genes in the powerful R system. Compared with the few existing bioinformatics tools for DO, e.g., FunDO, which explores disease information implied in the gene set by enrichment analysis, DOSim focuses on the computation of disease-disease and gene-gene similarities. Other utilities, such as function for gene module detection and gene module multilayered annotation, should help promote a better understanding of the complex pathogenesis of some disease risk phenotypes and the heterogeneity of some diseases. DOSim is available on the Comprehensive R Archive Network (CRAN) project or through http://bioinfo.hrbmu.edu.cn/dosim.
Availability and requirements
Project name: DOSim
Project home page: http://bioinfo.hrbmu.edu.cn/dosim
Operating system(s): platform independent
Programming language: R
Other requirements: none
Acknowledgements and Funding
This work is supported in part by the National Natural Science Foundation of China (Grant Nos. 30871394, 61073136 and 91029717), the Science Foundation of Heilongjiang Province (Grant Nos. ZD200816-01, JC200711, 2005-39, 1155H012, 11551232 and YJSCX2007-0195HLJ).
- Kalaria R: Similarities between Alzheimer's disease and vascular dementia. J Neurol Sci 2002, 203–204: 29–34.View ArticlePubMedGoogle Scholar
- Hu G, Agarwal P: Human disease-drug network based on genomic expression profiles. PLoS One 2009, 4(8):e6536. 10.1371/journal.pone.0006536PubMed CentralView ArticlePubMedGoogle Scholar
- Wang F, Syeda-Mahmood T, Beymer D: Finding Disease Similarity by Combining ECG with Heart Auscultation Sound. Computers in Cardiology 2007, 261–264.Google Scholar
- Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabasi AL: The human disease network. Proc Natl Acad Sci USA 2007, 104(21):8685–8690. 10.1073/pnas.0701361104PubMed CentralView ArticlePubMedGoogle Scholar
- McKusick VA: Mendelian Inheritance in Man and its online version, OMIM. Am J Hum Genet 2007, 80(4):588–604. 10.1086/514346PubMed CentralView ArticlePubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The genetic association database. Nat Genet 2004, 36(5):431–432. 10.1038/ng0504-431View ArticlePubMedGoogle Scholar
- Liu YI, Wise PH, Butte AJ: The "etiome": identification and clustering of human disease etiological factors. BMC Bioinformatics 2009, 10(Suppl 2):S14. 10.1186/1471-2105-10-S2-S14PubMed CentralView ArticlePubMedGoogle Scholar
- Fowler J, Kouramajian V, Maram S, Devadhar V: Automated MeSH indexing of the World-Wide Web. Proc Annu Symp Comput Appl Med Care 1995, 893–897.Google Scholar
- Zhang SH, Wu C, Li X, Chen X, Jiang W, Gong BS, Li J, Yan YQ: From phenotype to gene: detecting disease-specific gene functional modules via a text-based human disease phenotype network construction. FEBS Lett 2010, 584(16):3635–3643. 10.1016/j.febslet.2010.07.038View ArticlePubMedGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database issue):D258–261.PubMedGoogle Scholar
- Warren A, Kibbe J, Wolf W, Smith M, Zhu L, Lin S, Chisholm R: Disease Ontology. 2006.Google Scholar
- Osborne J, Flatow J, Holko M, Lin S, Kibbe W, Zhu L, Danila M, Feng G, Chisholm R: Annotating the human genome with Disease Ontology. BMC Genomics 2009, 10(Suppl 1):S6. 10.1186/1471-2164-10-S1-S6PubMed CentralView ArticlePubMedGoogle Scholar
- Du P, Feng G, Flatow J, Song J, Holko M, Kibbe WA, Lin SM: From disease ontology to disease-ontology lite: statistical methods to adapt a general-purpose ontology for the test of gene-ontology associations. Bioinformatics 2009, 25(12):i63–68. 10.1093/bioinformatics/btp193PubMed CentralView ArticlePubMedGoogle Scholar
- Pesquita C, Faria D, Falcão AO, Lord P, Couto FM: Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 2009, 5(7):e1000443. 10.1371/journal.pcbi.1000443PubMed CentralView ArticlePubMedGoogle Scholar
- Resnik P: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal 1995, 1: 448–453.Google Scholar
- Lin D: An Information-Theoretic Definition of Similarity. ICML '98: Proceedings of the Fifteenth International Conference on Machine Learning 1998, 296–304.Google Scholar
- Jiang J, Conrath D: Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference on Research in Computational Linguistics, Taiwan 1998.Google Scholar
- A. Schlicker FD: A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics 2006.Google Scholar
- C. Pesquita DF: Evaluating GO-based Semantic Similarity Measures. In: Proc 10th Annual Bio-Ontologies Meeting 2007, 37–40.Google Scholar
- A. Feltus B, Li JW: Effectively Integrating Information Content and Structural Relationship to Improve the GO-based Similarity Measure Between Proteins. BMC Bioinformatics 2009.Google Scholar
- James Z, Wang ZD: A new method to measure the semantic similarity of GO terms. Bioinformatics 2007, 1274–1281.Google Scholar
- Couto F, Silva M, Coutinho P: Semantic Similarity over the Gene Ontology: Family Correlation and Selecting Disjunctive Ancestors. Conference in Information and Knowledge Management 2005.Google Scholar
- Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 2003, 19(10):1275–1283. 10.1093/bioinformatics/btg153View ArticlePubMedGoogle Scholar
- Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24(5):719–720. 10.1093/bioinformatics/btm563View ArticlePubMedGoogle Scholar
- Frohlich H, Speer N, Poustka A, BeiSZbarth T: GOSim: an R-package for computation of information theoretic GO similarities between terms and gene products. BMC Bioinformatics 2007, 8(1):166. 10.1186/1471-2105-8-166PubMed CentralView ArticlePubMedGoogle Scholar
- Li C, Li X, Miao Y, Wang Q, Jiang W, Xu C, Li J, Han J, Zhang F, Gong B, et al.: SubpathwayMiner: a software package for flexible identification of pathways. Nucleic Acids Res 2009, 37(19):e131. 10.1093/nar/gkp667PubMed CentralView ArticlePubMedGoogle Scholar
- Tokoro Y: Cytology of malignant lymphoma. Rinsho Byori 2010, 58(11):1113–1120.PubMedGoogle Scholar
- Iannitto E, Tripodo C: How I diagnose and treat splenic lymphomas. Blood 2010, 117(9):2585–2595.View ArticlePubMedGoogle Scholar
- Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, Workman C, Christmas R, Avila-Campilo I, Creech M, Gross B, et al.: Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2007, 2(10):2366–2382. 10.1038/nprot.2007.324PubMed CentralView ArticlePubMedGoogle Scholar
- Haslam DW, James WP: Obesity. Lancet 2005, 366(9492):1197–1209. 10.1016/S0140-6736(05)67483-1View ArticlePubMedGoogle Scholar
- Yu W, Clyne M, Khoury MJ, Gwinn M: Phenopedia and Genopedia: disease-centered and gene-centered views of the evolving knowledge of human genetic associations. Bioinformatics 2009, 26(1):145–146.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.