The functional importance of disease-associated mutation
© Mooney and Klein 2002
Received: 22 July 2002
Accepted: 9 September 2002
Published: 9 September 2002
Skip to main content
© Mooney and Klein 2002
Received: 22 July 2002
Accepted: 9 September 2002
Published: 9 September 2002
For many years, scientists believed that point mutations in genes are the genetic switches for somatic and inherited diseases such as cystic fibrosis, phenylketonuria and cancer. Some of these mutations likely alter a protein's function in a manner that is deleterious, and they should occur in functionally important regions of the protein products of genes. Here we show that disease-associated mutations occur in regions of genes that are conserved, and can identify likely disease-causing mutations.
To show this, we have determined conservation patterns for 6185 non-synonymous and heritable disease-associated mutations in 231 genes. We define a parameter, the conservation ratio, as the ratio of average negative entropy of analyzable positions with reported mutations to that of every analyzable position in the gene sequence. We found that 84.0% of the 231 genes have conservation ratios less than one. 139 genes had eleven or more analyzable mutations and 88.0% of those had conservation ratios less than one.
These results indicate that phylogenetic information is a powerful tool for the study of disease-associated mutations. Our alignments and analysis has been made available as part of the database at http://cancer.stanford.edu/mut-paper/. Within this dataset, each position is annotated with the analysis, so the most likely disease-causing mutations can be identified.
For many years, scientists believed that point mutations in genes are the genetic switches for somatic and inherited diseases such as cystic fibrosis, phenylketonuria and cancer. For this to be the case, disease-associated amino acid substitutions should occur in functionally important regions of the protein products of genes. While it has been shown in specific cases that disease-associated amino acid substitutions affect protein function, until now few studies have examined this across many genes. Here we provide direct evidence that disease-associated point mutations occur in functionally important regions of the genome and are not distributed equally across the coding regions of genes. This work supports recent efforts to collect disease associated mutational data in databases and suggests that many of the mutations represented in those databases are the likely underlying molecular cause of disease.
Recently there have been a number of commercial and public projects aimed at collecting and understanding human genomic variation . The goal of these projects is to provide an understanding of how genotype is associated with disease, how it affects our response to drugs and how it affects the protein products of genes. Examples of these projects include the SNP Consortium, the Human Genome Mutation Database , many gene specific databases ([3, 4], for example), and both public and private genome sequencing efforts . Much of the data that is being collected are mutations annotated with their observed phenotype. Automated annotation methods based on structural and evolutionary parameters can lead to insight into the molecular basis of disease.
With more than 4,000,000 identified variations and with over 20,000 of them annotated with a phenotype, we are facing the problem of having many uncharacterized mutations. Algorithms are needed for automatically annotating these gene variations to gain insight into how they affect the gene's regulation and/or function of its protein products. Using many collection technologies, uncharacterized SNP data is being placed in public databases such as the Human Genome Mutation Database (over 20,000 entries)  and the National Cancer Institute's CGAP-GAI (Cancer Genome Anatomy Project Genetic Annotation Initiative) . The CGAP-GAI group has identified 10,243 SNPs by examining publicly available EST (Expressed Sequence Tag) chromatograms.
Software for analyzing unannotated SNPs in known disease associated genes will be especially useful when previously unobserved mutations are discovered. Every human has genotypic differences from the standard genome approximately every thousand base pairs [7–9]. Given knowledge of how a genotype differs from the standard, it is important to be able to predict which of the variations are likely to be the cause of disease or other phenotypic differences. Evolutionary information about regulatory and coding regions of genes can be used to highlight certain mutations or groups of mutations that are attributable to a phenotype [10–12].
Early tools using phylogenetic and structural information have shown promise in predicting the functional consequences of a mutation . These reports predict that anywhere between 20–36% of non-synonymous SNPs alter the function of a gene's protein product. In the report by Chasman and Adams, evolutionary information was predicted to be a useful component in determining whether a mutation is deleterious [13, 14]. Disease causing mutations are also likely structurally perturbing at the protein level . Ng and Henikoff have introduced SIFT, a method for predicting functional SNPs from a database of unannotated polymorphisms [16, 17].
The relationship between disease-associated mutation positions and evolutionary conservation has been reported in specific cases. An analysis of the breast and ovarian cancer susceptibility gene, BRCA1, showed that disease-associated mutations tend to occur in highly conserved regions . An analysis of homologous sequences in the androgen receptor has shown similar results . Keratin 12, KRT12, is associated with Meesmann Corneal Epithelial Dystrophy (MCD). Reported mutations often occur in the highly conserved alpha-helix-initiation motif of rod domain 1A or in the alpha-helix-termination motif of rod domain 2B . Structure based analysis methods have also been used to analyze Osteogenesis imperfecta associated COL1A1 mutations and disease-associated P53 mutations (Mooney and Klein, unpublished),. Miller and Kumar have reported that disease-associated mutations are conserved in seven model genes .
To determine the degree to which mutation positions differ evolutionarily from other positions, we have built alignments of homologous genes for 231 disease-associated genes. These multiple alignments have then been used to assess the difference in evolutionary conservation for positions that are both disease-associated and not associated. The results show that, in general, positions with disease-associated mutations are conserved more than the average position in the alignment. This suggests the most conserved mutations are likely to be the causative agents of disease, and our data set identifies these mutations.
Genes used in analysis with conservation ratio. 231 total genes were analyzed
Genes with average conservation ratios of zero
Pituitary Hormone Deficiency
Omenn Syndrome, Immunodeficiency
To collect the mutation data, 231 genes were used for the analysis. They were chosen because they had a reported cDNA sequence, disease-associated mutations and homologs in SWISSPROT. These genes are listed in Table 1. Each alignment consists of all the homologs in SWISSPROT as determined by a BLAST search with an e-value threshold of 10e-15. For each alignment the negative entropy for each column was calculated.
The conservation ratio parameter is defined as the average negative entropy of analyzable positions with reported mutations divided by the average negative entropy of every analyzable position in the gene sequence. Analysis was performed on 231 genes and 6185 mutations and of those we found that 84.0% had conservation ratios less than one. From those, 139 genes had more than ten analyzable mutations and, of those, 88.0% had conservation ratios less than one.
Use of evolutionary information is a promising approach to automated characterization of mutations. These results show that although conservation alone is not a perfect predictive measure, there is useful information contained in sequence alignments containing homologous genes. Approaches using conservation in a multiple alignment should work better when associated with other methods such as structural analysis, population analysis and experimental data. Knowledge of how the sequence pool clusters into families may increase the sensitivity of the method.
Our measured parameter, the conservation ratio, is a quantity that measures the usefulness of a multiple alignment for characterizing mutations in a gene sequence. Knowledge of more mutations in a gene does not necessarily lower the conservation ratio. We expect that knowledge of more mutations in a gene will increase the statistical significance of the conservation ratio. This is the likely underlying cause of the result showing that genes with 10 or mutations increases are more likely to have a conservation ratio less than one.
The alignments and BLAST searches are integrated on the website, http://cancer.stanford.edu/mut-paper/.
In conclusion, there are estimated to be 30,000 non-synonymous differences between an individual and the draft genome [23, 5, 7–9]. Determination of which positions are likely to be disease associated is a challenging and important problem. The finding that disease-associated mutations occur in positions of functional importance supports recent efforts for the building of methods to predict which positions are likely to be disease associated [14, 13, 16–17]. These methods are likely to incorporate protein structure, the amino acid identity of the mutation and phylogenetic information. In an interesting twist, this observation also suggests that this data may be useable as a functional genomics tool for understanding the function of the protein products of genes on a molecular level. Such a method would use the inherent functional information contained in a phenotypically annotated polymorphism to infer functional importance within a gene.
Non-synonymous mutations were acquired from the Human Genome Mutation Database http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html. 231 genes were chosen with known disease-association each having SWISSPROT homologs, a cDNA sequence and mutations in the coding region. For each of those genes, all known non-synonymous mutations were then downloaded with the cDNA sequence for that gene.
Each cDNA sequence was then translated and placed in a FASTA formatted file. For each of the resultant files a BLAST  search was performed against the SwissProt database. All sequences from the returned hits were then stored in FASTA format files. For each of the genes that returned BLAST results with e-value scores smaller than 1e-15, ClustalW  was used to build a sequence alignment.
For each amino acid in the position of interest, the negative entropy was determined using the following formula :
Where the Pi are the probabilities of finding a particular amino acid at that position. For this analysis, gapped positions, "-", were considered independent amino acids.
For each known mutation, the negative entropy of the column it occupies was tabulated. The average negative entropy for each mutation within a gene was compared to the average entropy of all columns satisfying the criteria for analysis. Mutations outside of the coding region or mutations encoding termination codons were discarded.
The list of genes was then sorted by average negative entropy of the mutations. We then calculated the conservation entropy, CE, using:
CE = average NE of mutation positions/average NE of all positions in the gene sequence
The authors thank Irene Liu and Professor Russ Altman at Stanford University for their helpful suggestions and UCSF Computer Graphics Laboratory for use of resources. Funding provided by NIH Grant AR47720-01 (Teri E. Klein, PI) and Nitt Grant LM05652 (Russ B. Altman, PI).
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.