Inherited disorder phenotypes: controlled annotation and statistical analysis for knowledge mining from gene lists
© Masseroli et al; licensee BioMed Central Ltd 2005
Published: 1 December 2005
Analysis of inherited diseases and their associated phenotypes is of great importance to gain knowledge of underlying genetic interactions and could ultimately give clinically useful insights into disease processes, including complex diseases influenced by multiple genetic loci. Nevertheless, to date few computational contributions have been proposed for this purpose, mainly due to lack of controlled clinical information easily accessible and structured for computational genome-wise analyses. To allow performing phenotype analyses of inherited disorder related genes we implemented new original modules within GFINDer http://www.bioinformatics.polimi.it/GFINDer/, a Web system we previously developed that dynamically aggregates functional annotations of user uploaded gene lists and allows performing their statistical analysis and mining.
New GFINDer modules allow annotating large numbers of user classified biomolecular sequence identifiers with morbidity and clinical information, classifying them according to genetic disease phenotypes and their locations of occurrence, and statistically analyzing the obtained classifications. To achieve this we exploited, normalized and structured the information present in textual form in the Clinical Synopsis sections of the Online Mendelian Inheritance in Man (OMIM) databank. Such valuable information delineates numerous signs and symptoms accompanying many genetic diseases and it is divided into phenotype location categories, either by organ system or type of finding.
Supporting phenotype analyses of inherited diseases and biomolecular functional evaluations, GFINDer facilitates a genomic approach to the understanding of fundamental biological processes and complex cellular mechanisms underlying patho-physiological phenotypes.
Understanding clinical phenotypes through their corresponding genotypes is paramount to unveil inherited alterations that can lead to pathological processes and syndromes. However, such comprehension can be very difficult with complex disorders, which frequently present different clinical phenotypes that may result from interactions among multiple and potentially unknown genetic loci. Moreover, considerably different genetic alterations may cause very similar or even the same phenotype [1, 2]. Thus, complex and multivariate analyses of the molecular processes underlying phenotypically similar disorders are required to possibly obtain insights into the composite gene and protein interactions . To computationally perform such analyses, numerous structured information and also a few controlled vocabularies that describe biological processes and molecular functions are available [3–6].
Nevertheless, useful clinical information related to genetic diseases is generally not easily accessible and is mainly included in free text descriptions. Hence, it is not suitably organized to be used in computational analyses. This limited availability of controlled structured phenotypic information is hampering the development of effective analytical contributions in the field.
Recently, some tools have been developed to extract genetic and disease information from free text [7–11]. These, which are based on term co-occurrence and association rules or Natural Language Processing techniques, automatically extract sets of genetic and phenotypic related terms. However, due to complexity and variety of clinical biomolecular and genomic descriptions they inherently present extraction errors, with different degrees of precision and recall. Therefore, extracted information should be revised before applying it in subsequent analyses.
In some medical areas, such as oncology, curated phenotypic information of complex genetic disorders is being collected in structured format [12–14]. Nevertheless, currently such data are only available for few classes of diseases and in quantity not yet enough for computational genome-wise analyses.
To effectively exploit the valuable information in the OMIM Clinical Synopsis section, we first extracted phenotype and their location names and normalized them to create a term vocabulary describing phenotype and phenotype location categories. Then, we hierarchically structured these category descriptions according to increasing detail or topological levels. Finally, within GFINDer, a Web system we previously developed for analyzing dynamically aggregated annotations of user uploaded gene lists , we used the normalized and structured Clinical Synopsis vocabularies as basis for new GFINDer modules specifically devoted to the analysis of inherited disorder related genes. These new modules allow annotating large numbers of user classified biomolecular sequence identifiers with morbidity and clinical information, classifying them according to genetic disease phenotypes and their locations of occurrence, and statistically analyzing the obtained classifications.
Normalization and structuring of genetic disease phenotype and location terms
Hierarchical structure of some of the phenotype categories considered in GFINDer, as derived from the correspondent phenotype descriptions provided by OMIM databank.
present at birth
straightening with time
following ingestion of fava beans
in some patients
Similarly, the normalized 94 unique phenotype locations resulted hierarchically structured in three topological levels according to their anatomical organization. The main of such hierarchical levels, which includes broader organ systems or sites, comprised 36 locations.
All the above normalizations and hierarchical structuring produced a total of 33,338 phenotype location entries and 49,072 specific phenotype entries for the available 4,570 OMIM entries with a Clinical Synopsis, which were annotated to 11,433 distinct genes.
Analysis of inherited disorder phenotypes
Validation of implemented application
To assess capabilities of implemented GFINDer Genetic Disorders modules, we used them to evaluate a set of 1,046 human clones spotted on the 7734-1 or 7736-1 Clontech microarrays  focused on cardiovascular system (522 clones) and neurobiology (524 clones), respectively. Using GFINDer Annotation module we found these clones corresponded to 935 distinct genes. Out of them, 271 (250 autosomal and 21 X-linked) were involved in 462 inherited diseases, and 122 (97 autosomal and 25 X-linked) were associated with 679 different phenotypes in 63 locations. Therefore, we used the GFINDer Exploration and Statistics Genetic Disorders module to evaluate the relevant presence of genes associated with specific inherited disease phenotypes or locations within the considered cardiovascular system correlated genes (CARDIO) versus the neurobiology related genes (NEURO).
With the Exploration module we observed the distribution of phenotypes and phenotype locations (Figure 2) within the two considered CARDIO and NEURO classes of genes. Then, using the Statistics module we evaluated the phenotype locations most represented in the CARDIO versus NEURO class. We concentrated only on genes with phenotype location annotations and on location categories associated with at least two of the considered genes. Statistical analysis correctly selected phenotype locations related to the appropriate class of considered genes. In fact, the significant selected locations included "Cardiovascular" (p = 0.00074), "Heme" (p = 0.02206), "Heart" (p = 0.03165) and "Cardiac" (p = 0.03562) categories for the CARDIO class, and "Neurologic" (p < 0.00001), "Central nervous system" (p = 0.0002), "Behavioural/Psychiatric manifestations" (p = 0.00347), "Peripheral nervous system" (p = 0.01702) categories for the NEURO class.
Finally, we analyzed the phenotypes most represented in the CARDIO versus NEURO class. Focusing only on genes with phenotype annotations and on phenotype categories associated with at least two of the considered genes, GFINDer statistical analysis properly highlighted as most relevant in each gene class signs and symptoms logically pertaining to that class (Figure 3). In fact, apart from inheritance pattern phenotypes, the CARDIO gene class included "Heart failure congestive" (p = 0.01402) and "Hypertension" (p = 0.02271) phenotypes, whereas the NEURO class included "Dementia" (p = 0.00159), "Myoclonus" (p = 0.00677), "Dysarthria" (p = 0.00884) and "Mental retardation" (p = 0.01557) phenotypes.
Obtained results demonstrate validity of the approach for the analysis of genetic disorder phenotypes, locations and related genes that we developed, implemented and made available within the GFINDer Web system.
Our efforts to derive from the OMIM entries a controlled vocabulary of phenotype locations and descriptions enabled us to normalize and structure the valuable OMIM phenotypic data according to the obtained vocabulary and make them suitable for computational use. Although detailed phenotype descriptions could be further homogenized and standardized, their subdivision in hierarchical levels of detail that we performed allows to group specific phenotypes according to their common general traits, without loosing their specific characteristics. So, for example "Mental retardation, moderate" and "Mental retardation, nonspecific" can be both generally considered as "Mental retardation" and at the same time they can be treated as different types of mental defects. This provides the chance to modulate analysis granularity when searching for phenotypic traits shared among multiple diseases or genotypes. It also ensures more significant and clear results when categorical statistical analyses are performed at lower granularity levels of detail. Such interesting feature, proper of the hierarchical structure and hence belonging also to the defined phenotype location hierarchy, is exploited in the new GFINDer Genetic Disorders modules implemented for the study of genetic disorder related genes.
In the Exploration module, the user can select the detail level of phenotype description, or topological location, at which exploring the genetic disorder phenotypic annotations associated with a considered set of genes, or explore all levels at the same time (Figure 2). In the Statistics module, consecutive statistical tests are executed on each categorical annotation independently on its level of detail. Then, analysis results are shown listing each tested categorical annotation with its hierarchical level and the obtained p-value (Figure 3). This simultaneously provides a comprehensive view of the statistical significance of all considered annotations and clearly highlights the phenotypic characteristics with lowest p-value within each of the considered user-defined classes of genes, specifying also their granularity level. Validation results showed that the above is correctly performed also when genes in different considered classes are associated with the same genetic disorder phenotypes, as it happens for disorders that may have a cardiovascular and neurological etiology. In these cases, although obtained p-values do not reach statistical significance, lower p-values properly indicate more relevant phenotypic characteristics (Figure 3).
GFINDer, which is freely available on-line for non-profit use at http://www.bioinformatics.polimi.it/GFINDer/, is hence a unique valuable tool that provides support for a phenotypic taxonomy of inherited diseases. Although several tools are available for the analysis of gene annotations according to the Gene Ontology and few other controlled vocabularies, at present to our knowledge none supports phenotype analyses.
Our normalization and structuring of the valuable phenotypic information that OMIM offers generated a controlled phenotypic vocabulary suitable for computational purposes. As our validation demonstrated, its use within the new-implemented GFINDer modules allows effective phenotypic analyses of inherited disorders related genes. The new GFINDer functionalities can hence help in better interpreting high-throughput gene lists and in unveiling new biomedical knowledge about the considered genes. Thus, they can facilitate a genomic approach in the understanding of fundamental biological processes and complex cellular mechanisms, underlying patho-physiological phenotypes, through their corresponding genotypes.
Normalization and structuring of genetic disease phenotype and location terms
From the omim.txt file, which contains the entire free text of the OMIM databank and is freely available from the OMIM FTP site , by using standard text-parsing procedures we extracted phenotype descriptions and phenotype location names included in its Clinical Synopsis sections. Then, the isolated names and descriptions were visually inspected and normalized. A unique term was assigned to each location synonym or incorrect written name. Phenotype descriptions were corrected for typo errors and different descriptions for the same sign or symptom were uniformed to the same most recurrent or most correct description. Furthermore, when phenotype descriptions included general illustrative terms together with more specific traits, we subdivided them into hierarchical levels according to their increasing degree of detail. Similarly, we hierarchically structured also phenotype locations. In the available OMIM Clinical Synopsis sections, phenotype locations are partially and inconsistently structured according to their topological organization (Figure 1). After location term normalization, we homogenized the provided structure and organized it in hierarchical levels in relation to the anatomical organization of the described location categories.
Implemented GFINDer architecture for phenotype annotation analyses
GFINDer Web system is implemented in a three-tier architecture based on a multi-database structure. In the first tier, the data tier, a MySQL DBMS manages all considered genomic annotations stored in different relational databases. In this tier, we added a specifically designed relational database where we hierarchically structured information about genetic disease phenotypes by exploiting the multi-granular level and topological descriptions of OMIM Clinical Synopsis phenotypes and phenotype locations, respectively. To associate an inherited disorder phenotype with the involved genes or genetic loci, if any, we considered the MIM codes associated with a gene, as provided by the Entrez Gene database . Using Java programming language, we implemented procedures able to automatically import and keep updated, in the GFINDer data tier, genetic disorder phenotype information and correspondent gene annotations, as soon as new releases of them become available in OMIM and Entrez Gene databanks. According to the defined controlled vocabulary of phenotype descriptions and phenotype location terms, specific procedures automatically normalize and structure previously revised annotations included in the latest imported release of OMIM Clinical Synopsis information. When novel Clinical Synopsis annotations are included in a new imported release, an advisory email is automatically sent to the GFINDer phenotype vocabulary supervisor who can use specifically designed GFINDer Web interfaces to review and, if required, normalize and structure the retrieved new annotations.
Statistical techniques were used in our approach implemented in GFINDer for analyzing genetic disorder phenotypes and their locations. Because a gene may or may not be associated with a certain phenotype category defined in the used controlled vocabulary, the number of genes and their frequency, distribution, and probability of occurrence is calculated for each phenotype category related to a considered gene set. Several different statistical tests can be used to calculate a probability value of having x genes or fewer associated with a given phenotype category. In GFINDer the hypergeometric test (more time-consuming), the binomial test (which is an asymptotic limit of the first for high number of genes), and the exact Fisher test (based on a two-way table crossing gene classes and phenotype categories) were implemented [17, 20, 21]. As usual in all significance tests, small p-values relate to relevant phenotype categories for a certain class of genes. However, depending on the number of considered genes and their associated phenotype annotations, the number of performed statistical tests can be high. This can greatly increase the Type-I error associated to the tests, i.e. the probability of obtaining a significant p-value by chance when the null hypothesis is true (or the false-positive value, as it is known in the medical field). This requires corrections on the calculated p-values in order to obtain proper significances.
In GFINDer several correction methods for multiple tests have been included. The simplest and most strict is the Bonferroni method that can be applied if the performed tests are independent . It consists on changing the threshold α of each single test, from which every corresponding p-value of each test is considered significant, in such a way that the Type-I error of the whole set of tests is maintained. The correction is the following αcorrected = α / N, where N is the number of performed tests. From a practical point of view this is equivalent to keep the usual threshold α for the performed tests and apply a correction to the observed p-values such that pcorrected = N * p. However, the Bonferroni method greatly reduces the power of detecting a specific hypothesis when the number of tests increases. False Discovery Rate (FDR) and Family-Wise Error-rate (FWE), an extension of Bonferroni method, are milder corrections and they are even suitable when independence among tests does not hold . The former briefly consists in ordering the N p-values such that the maximum has rank N and the minimum has rank 1. Then, the correction to be applied is pcorrected = p * N / rank(p), except for the maximum p-value that is not corrected. The latter instead, in the implementation proposed by Benjamini and Hochberg , uses the following p-value correction: pcorrected = p * (N - rank(p) + 1). All three methods above illustrated are available in GFINDer. Among them the FDR, which is the mildest of the three and practically consists in defining the maximum acceptable number of obtained false-positive tests, is considered the most suitable correction method to be applied on genomic data.
- Phillips TJ, Belknap JK: Complex-trait genetics: emergence of multivariate strategies. Nat Rev Neurosci 2002, 3: 478–485.PubMedGoogle Scholar
- Cantor MN, Lussier YA: Mining OMIM for insight into complex diseases. In Proceedings of Medinfo 2004: 7–11 September 2004; San Francisco, CA. Edited by: Fieschi M, Coiera E, Li Y-CJ. Amsterdam, NL: IOS Press; 2004:753–757.Google Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2005, 33: D54-D58. 10.1093/nar/gki031PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein Knowledgebase. Nucleic Acids Res 2004, 32: D115-D119. 10.1093/nar/gkh131PubMed CentralView ArticlePubMedGoogle Scholar
- Sonnhammer ELL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32: D262-D266. 10.1093/nar/gkh021PubMed CentralView ArticlePubMedGoogle Scholar
- Hristovski D, Peterlin B, Mitchell JA, Humphrey SM: Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005, 74(2–4):289–298. 10.1016/j.ijmedinf.2004.04.024View ArticlePubMedGoogle Scholar
- Koike A, Niwa Y, Takagi T: Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics 2005, 21(7):1227–1236. 10.1093/bioinformatics/bti084View ArticlePubMedGoogle Scholar
- Chen L, Friedman C: Extracting phenotypic information from the literature via natural language processing. In Proceedings of Medinfo 2004: 7–11 September 2004; San Francisco, CA. Edited by: Fieschi M, Coiera E, Li Y-CJ. Amsterdam, NL: IOS Press; 2004:758–762.Google Scholar
- Rindflesch TC, Libbus B, Hristovski D, Aronson AR, Kilicoglu H: Semantic relations asserting the etiology of genetic diseases. In Proceedings of AMIA 2003 Annual Symposium: 8 – 12 November 2003; Washington, DC. Edited by: Musen MA. Bethesda, MD: Omnipress; 2003:554–558.Google Scholar
- Alma Knowledge Server[http://aks.bioalma.com/]
- Baasiri RA, Glasser SR, Steffen DL, Wheeler DA: The breast cancer gene database: a collaborative information resource. Oncogene 1999, 18(56):7958–7965. 10.1038/sj.onc.1203335View ArticlePubMedGoogle Scholar
- Steffen DL, Levine AE, Yarus S, Baasiri RA, Wheeler DA: Digital reviews in molecular biology: approaches to structured digital publication. Bioinformatics 2000, 16(7):639–649. 10.1093/bioinformatics/16.7.639View ArticlePubMedGoogle Scholar
- Becker KG, Barnes KC, Bright TJ, Wang SA: The Genetic Association Database. Nature Genet 2004, 36: 431–432. 10.1038/ng0504-431View ArticlePubMedGoogle Scholar
- McKusick VA: Mendelian Inheritance in Man. A catalog of human genes and genetic disorders. 12th edition. Baltimore, MD, Johns Hopkins University Press; 1998.Google Scholar
- Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res 2005, 33: D514-D517. 10.1093/nar/gki033PubMed CentralView ArticlePubMedGoogle Scholar
- Masseroli M, Martucci D, Pinciroli F: GFINDer: Genome Function INtegrated Discoverer through dynamic annotation, statistical analysis, and mining. Nucleic Acids Res 2004, 32: W293-W300. 10.1093/nar/gkh108PubMed CentralView ArticlePubMedGoogle Scholar
- BD Biosciences Clontech[http://www.bdbiosciences.com/clontech/]
- Online Mendelian Inheritance in Man (OMIM) FTP site[ftp://ftp.ncbi.nih.gov/repository/OMIM/]
- Casella G, Berger RL: Statistical inference. 2nd edition. Belmont, CA, Duxbury Press; 2002.Google Scholar
- Fisher LD, van Belle G: Biostatistics: a methodology for the health sciences. New York, NY, John Wiley & Sons; 1993.Google Scholar
- Bonferroni CE: Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 1936, 8: 3–62.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a practical and powerful approach to multiple testing. J R Stat Soc 1995, 57: 289–300.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.