Quantitative prediction of the effect of genetic variation using hidden Markov models
© Liu et al.; licensee BioMed Central Ltd. 2014
Received: 2 September 2013
Accepted: 2 January 2014
Published: 9 January 2014
With the development of sequencing technologies, more and more sequence variants are available for investigation. Different classes of variants in the human genome have been identified, including single nucleotide substitutions, insertion and deletion, and large structural variations such as duplications and deletions. Insertion and deletion (indel) variants comprise a major proportion of human genetic variation. However, little is known about their effects on humans. The absence of understanding is largely due to the lack of both biological data and computational resources.
This paper presents a new indel functional prediction method HMMvar based on HMM profiles, which capture the conservation information in sequences. The results demonstrate that a scoring strategy based on HMM profiles can achieve good performance in identifying deleterious or neutral variants for different data sets, and can predict the protein functional effects of both single and multiple mutations.
This paper proposed a quantitative prediction method, HMMvar, to predict the effect of genetic variation using hidden Markov models. The HMM based pipeline program implementing the method HMMvar is freely available at https://bioinformatics.cs.vt.edu/zhanglab/hmm.
Genomic variability contributes to evolution and population diversity. With the development of high throughput technologies, a massive amount of variation data is available in online public databases, for example, dbSNP , dbVar , Human Gene Mutation Database , Ensembl , and Catalogue of Somatic Mutations in Cancer (COSMIC) . Different types of variation have been identified, such as single nucleotide polymorphisms (SNP), short sequence repeat, insertion/deletion polymorphism (indel), copy number variants (CNV), and inversions. Recent pilot studies from the 1000 Genomes Project Consortium  and the International HapMap Project  revealed that there are about 15 million SNPs, one million short indels, and 20,000 structural variants (SVs) harbored by the studied populations.
Indels, especially frame shifting insertions and deletions, are expected to have large effects on protein functions, since they may change the reading frame of a gene thus change amino acids and probably the functions of proteins. It has been shown that indels cause more severe functional changes in proteins than SNPs  and also have significant influence on protein-protein interaction interfaces . As revealed by the Human Gene Mutation Database , approximately half (57%) of the human (gene sequence level) disease variations are associated with single nucleotide substitutions, and about a quarter (22%) are associated with small indels [3, 10]. Mill et al.  have shown that 42% of the nearly two million indels they identified are mapped to human genes and more than 2,000 indels affect coding exons and likely disrupt protein function and cause phenotypic change in humans. Moreover, they found that many of the identified indels had a high level of linkage disequilibrium (LD) with SNPs, which indicates the indels might be the essential factors that cause diseases. Furthermore, indel variants have profound functional impact in human specific evolution and adaptation [12–14].
With an increasing amount of genomic variability data, computational tools for prediction of the functional impacts of these variants on proteins are needed to help biologists select variants for experimental studies. So far, SNPs have been intensively studied and tools for predicting SNP functional effects have been developed, while little is known about the functional effects of indels, the second most common type of genetic variation in humans.
The protein sequence based prediction methods for functional effects of different types of variants are typically grouped into two classes , constraint based predictor and trained classifier. Previous studies mainly concern SNPs and there are a few dozen computer programs and web servers devoted to predicting the effects of SNP variants. For example, SIFT SNP  is a constraint based predictor and PolyPhen  is a trained classifier, both protein sequence based. There are also many nucleotide sequence based prediction methods using evolutional information, such as GERP , SCONE , etc. In contrast, the efforts devoted to indel effect prediction are limited. Recent indel prediction studies include an evolutionary conservation based approach for both coding and noncoding regions , a trained classifier method for frameshift variants , and another evolutionary conservation based method for multiple types of variation . This paper proposes a profile hidden Markov model (HMM)  based approach HMMvar, which differs from previous approaches in having a formal probabilistic basis.
HMMvar prediction of indels
Dataset from dbSNP
To test the consistency of HMMvar scores with a genome wide analysis, the indels with minor allele frequency (MAF) in dbSNP were extracted, resulting in 447 indels to be scored. The less the allele frequency is in a certain position of a genome, the more conserved the site and the more deleterious the effect of a mutation at this site, in terms of evolutionary theory. In this experiment, the MAF shows a negative Pearson correlation with the HMMvar score (r = –0.03), which is consistent with the common indication of MAF (the lower the MAF, the higher the significance of the site), though the correlation is not significant.
Comparison with other tools
Comparison between HMMvar prediction and SIFT Indel prediction with dbSNP indel dataset
Dataset from ENSEMBL
Benign + Possibly damaging/PolyPhen
Validation on individual proteins TP53
Insertion and deletion variant data, limited to coding regions, was downloaded from dbSNP Build 137 (http://www.ncbi.nlm.nih.gov/projects/SNP/) and grouped into two categories, indels with records of disease association in the Locus-specific Mutation Database (LSDB)  and those without LSDB records. There are 2631 indels with LSDB annotation and 1458 indels without such records (Table 1). The first disease associated indel group is assumed to be more deleterious than the second one. 393 coding SNPs, for which there are either SIFT SNP or PolyPhen prediction records in Ensembl (Table 3), were used for comparison with the current HMMvar scoring method. For the human tumor suppressor protein TP53, a set of 2,565 SNP mutants and corresponding biological activity levels were obtained from the database IARC TP53 . The mutants associated with TP53 were partitioned into four classes: nonfunctional, partially functional, functional (wildtype), and supertrans (higher activity than wildtype) . Transactivity level was measured by eight promoter-specific activity levels and the classification was made in terms of the median of these eight levels. The dataset CFTR was obtained from the Human Gene Mutation Database (HGMD Professional 2012.3); only SNP mutants were included. The CFTR gene mutants have typical phenotypes, such as cystic fibrosis (CF), congenital absence of vas deferens (CAVD), pancreatitis, etc. This work used only the two largest groups CF (732 single point mutants) and CAVD (98 single point mutants) to test the profile HMM prediction method.
According to the theory of natural selection, different regions of a functional sequence are subject to different selective pressures. Multiple sequence alignment reveals this by residual conservation in certain positions. Some positions are more conserved than others, and some regions are more tolerant to insertion and deletion variants than others. Mutants occurring at highly conserved residuals are more likely to be deleterious, whereas mutants occurring at lowly conserved residuals are more likely to be neutral or less deleterious. A profile HMM is a nondeterministic finite state machine consisting of a series of states, each of which corresponds roughly to a position (column) in the multiple sequence alignment from which the HMM was built . Scoring (computing the probability of generation by a given Markov process) a wild type sequence or mutated sequence with the profile HMM gives one an idea how far the given sequence is away from the original population. A profile HMM captures the characteristics of a multiple sequence alignment, from which quantitative conservation information (a probability) is obtained. Thus, a high score of the probability of generation from the profile HMM for the wild type sequence and a low HMMvar score for the mutant sequence probably mean that the mutation has deleterious effect.
The five-step prediction pipeline (Figure 1) receives a set of indels (or other types of variants) as input. The first step identifies all unique proteins associated with these indels as wild type sequences (seeds). Since there may be multiple indels associated with one protein and multiple proteins may be involved with one indel, it is more computationally efficient to first identify all the proteins involved. The mutant sequences for a given wild type sequence are obtained by inserting the indels into the wild type sequence. The second step, using the identified proteins as seeds, invokes PSIblast  on the nonredundant protein sequence (nr) database to find a set of homologous sequences for each seed protein. The e-value and iteration limits were 0.01 and five, respectively. Only homologous sequences with an identity percentage higher than 90% are used in the next step. The third step invokes ClustalW2  with the BLOSUM62 matrix and the word size three for multiple sequence alignment for each homologous sequences set. The next step builds profile hidden Markov models with HMMER3  using the multiple sequence alignments as training data (one HMM per seed protein). All mutant type sequences derived from the same seed sequence will use the same HMM for functional effect prediction. The last step uses all the constructed HMMs for functional predictions. Precisely, given an input indel (mutant type) corresponding to seed protein i (wild type), the ith profile HMM is used to compute the HMMvar score S, as defined below.
where l is the length of the sequence and P1 is set to 350/351 in the architecture of plan 7 null model . From the null model and bit score equation, the probability P w or P m can be derived as P = Pnull *e B given a wild type sequence or mutated type sequence.
Each wild type sequence (or seed protein) corresponds to one HMM model. Theoretically, the wild type sequence bit score could be less than or equal to zero, however, it makes no sense to compare the mutant type sequences with this wild type sequence, because the wild type itself does not match with the HMM model. Consequently, we consider only the HMMs whose wild type sequence bit scores are greater than zero and compute the odds ratio for mutant type sequences that derive from these wild type sequences. The odds ratio is expected to be greater than 1, indicating the wild type sequence is more likely to occur in the HMM presented family. However, in practice, this is not always the case, which indicates that the mutant type sequence better fits the homology set profile. This situation may result from the nucleotide level mutation causing the amino acid level changes to be more compatible  with the homologous sequences than the wild type protein.
If the HMMvar score S is less than a threshold t, the indel is considered as neutral, otherwise deleterious. Fisher’s exact test was used to choose the threshold, using SIFT indel prediction as the reference method, by minimizing the exact test p-value, giving the optimal threshold t = 2.0 for the data sets used.
which is the base 2 logarithm of the relative risk (probability of generating the wild type sequence over the probability of generating the mutant type sequence). This was done for the TP53 and CFTR datasets, but the prediction results using D were not better than for S, and hence are not reported here.
The selection of homologous sequences is key to building a high quality profile HMM. The nonredundant protein sequence (nr) database was used with PSIblast  to collect homologous sequences for each seed protein, using e-value 0.01, and iteration limit five. All sequences above 90% identity were selected as homologous sequences for a certain seed protein. Attempts to improve diversity in the homologous sequence set by including the sequences below 10% identity or using instead all sequences from 60% identity to 95% identity did not produce better HMMvar score distributions. SIFT SNP prediction is used as a reference to determine HMMvar score thresholds of 2.
Most existing methods for variant effect prediction are based on evolutionary conservation theory, which predicts that highly conserved sites experience strong purifying selection and mutations in these sites are most likely to be deleterious to protein function. However, these methods take each site independent of other sites and do not consider the impact of surrounding sites. Moreover, most of these methods are designed only for SNP variants. In contrast, a profile HMM serves as a representation of a set of homologous sequences, relating all sites through a Markov process. Consequently, the present method HMMvar can provide functional predictions for the effects of all types of sequence variations besides SNPs, and can predict the effect of multiple variants simultaneously. The latter is especially useful as when multiple variants occur in a protein, each one of them may have deleterious effects on protein function, but the combination of them may have less harmful effect due to the possibility of compensatory effect. Profile HMMs, used as proposed, have the capability to predict the total effect of multiple mutations along the gene given a specific haplotype.
Factors affecting the prediction of indel effect
It is expected that the quality of multiple sequence alignment is another factor that can potentially affect the prediction of indel effect. Comparing the HMMvar scores based on different multiple sequence alignment programs, ClustalW  and MUSCLE , for the TP53 transitivity level dataset, showed that HMMvar scores based on the MUSCLE sequence alignment decreases more smoothly and shows lower variance within the same functional classes than scores based on the ClustalW sequence alignment. This suggests that having high quality sequence alignment is important for accurate indel effect prediction.
With the dramatic increase of the number of genetic variations discovered in human and other species’ populations, much effort is required in order to fully understand their effect on species. This paper proposed a quantitative prediction method, HMMvar, to predict the effect of genetic variation, both indels and SNPs, using hidden Markov models. Results show that HMMvar can achieve good performance in identifying deleterious or neutral variants for different datasets, and can predict the protein functional effects of both single and multiple mutations.
The work was partially supported by a NIH grant to Zhang.
- Sherry S, Ward M, Kholodov M: dbSNP: the ncbi database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.View ArticlePubMed CentralPubMedGoogle Scholar
- MacDonald JR, et al: The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 2013, 42: D986-D992.View ArticlePubMed CentralPubMedGoogle Scholar
- Stenson P, Mort M, Ball E: The human gene mutation database: 2008 update. Genome Med. 2009, 22 (1): 13-View ArticleGoogle Scholar
- Flicek P, Amode M, Barrell D: Ensembl 2012. Nucleic Acids Res. 2012, 40: D84-D90. 10.1093/nar/gkr991.View ArticlePubMed CentralPubMedGoogle Scholar
- Forbes S, Bindal N, Bamford S: Cosmic: mining complete cancer genomes in the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2010, 39: D945-D950.View ArticlePubMed CentralPubMedGoogle Scholar
- 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467: 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- The International HapMap 3 Consortium: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467: 52-58. 10.1038/nature09298.View ArticlePubMed CentralGoogle Scholar
- Schönhuth A, et al: Towards improved assessment of functional similarity in large-scale screens: a study on indel length. J Comput Biol. 2010, 17 (1): 1-20. 10.1089/cmb.2009.0031.View ArticlePubMedGoogle Scholar
- Hormozdiari F, et al: The effect of insertions and deletions on wirings in protein-protein interaction networks: a large-scale study. J Comput Biol. 2009, 16 (2): 159-167. 10.1089/cmb.2008.03TT.View ArticlePubMedGoogle Scholar
- Stenson P, Ball E, Mort M: Human gene mutation database (HGMD): 2003 update. Hum Mutat. 2003, 21 (6): 577-581. 10.1002/humu.10212.View ArticlePubMedGoogle Scholar
- Mills R, Pittard W, Mullaney J: Natural genetic variation caused by small insertions and deletions in the human genome. Genome Res. 2011, 21: 830-839. 10.1101/gr.115907.110.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen C, Chuang T, Liao B: Scanning for the signatures of positive selection for human-speci¯c insertions and deletions. Genome Biol Evol. 2009, 1: 415-419.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen C, Chen F, Li W: Human-specific insertions and deletions inferred from mammalian genome sequences. Genome Res. 2007, 17 (1): 16-22.View ArticlePubMed CentralPubMedGoogle Scholar
- Wetterbom A, Sevov M, Cavelier L: Comparative genomic analysis of human and chimpanzee indicates a key rol for indels in primate evolution. J Mol Evol. 2006, 63: 682-690. 10.1007/s00239-006-0045-7.View ArticlePubMedGoogle Scholar
- Cooper G, Shendure J: Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet. 2011, 12: 628-640. 10.1038/nrg3046.View ArticlePubMedGoogle Scholar
- Pauline C, Henikoff S: Predicting Deleterious amino acid substitutions. Genome Res. 2001, 11: 863-874. 10.1101/gr.176601.View ArticleGoogle Scholar
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30 (17): 3894-3900. 10.1093/nar/gkf493.View ArticlePubMed CentralPubMedGoogle Scholar
- Cooper G, Stone E, Asimenos G: Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005, 15 (7): 901-913. 10.1101/gr.3577405.View ArticlePubMed CentralPubMedGoogle Scholar
- Asthana S, Roytberg M, Stamatoyannopoulos J: Analysis of sequence conservation at nucleotide resolution. PLOS Comput Biol. 2007, 3: e254-10.1371/journal.pcbi.0030254. doi:10.1371/journal.pcbi.0030254View ArticlePubMed CentralPubMedGoogle Scholar
- Zia A, Moses A: Ranking insertion, deletion and nonsense mutations based on their effect on genetic information. BMC Bioinforma. 2011, 12: 299-10.1186/1471-2105-12-299.View ArticleGoogle Scholar
- Hu J, Pauline C: Predicting the effects of frameshifting indels. Genome Biol. 2012, 13: R9-10.1186/gb-2012-13-2-r9.View ArticlePubMed CentralPubMedGoogle Scholar
- Choi Y, Sims G, Murphy S: Predicting the functional effect of amino acid substitutions and indels. PLoS One. 2012, 7 (10): e46688-10.1371/journal.pone.0046688. doi:10.1371/journal.pone.0046688View ArticlePubMed CentralPubMedGoogle Scholar
- Eddy S: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Soussi T, Ishioka C, Claustres M: Locus-specific mutation databases: pitfalls and good practice based on the p53 experience. Nat Rev Cancer. 2006, 6: 83-90. 10.1038/nrc1783.View ArticlePubMedGoogle Scholar
- Kato S, Han S, Liu W: Understanding the function-structure and function-mutation relationships of p53 tumor suppressor protein by high-resolution missense mutation analysis. Proc Natl Acad Sci USA. 2003, 100 (14): 8424-8429. 10.1073/pnas.1431692100.View ArticlePubMed CentralPubMedGoogle Scholar
- Petitjean A, Mathe E, Kato S: Impact of mutant p53 functional properties on TP53 mutation patterns and tumor phenotype: lessons from recent developments in the IARC TP53 database. Hum Mutat. 2007, 28: 622-629. 10.1002/humu.20495.View ArticlePubMedGoogle Scholar
- Stephen F, LM T, Alejandro A: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.View ArticleGoogle Scholar
- Larkin M, Blackshields G, Brown N: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Finn R, Clements J, Eddy S: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39 (2): W29-W37.View ArticlePubMed CentralPubMedGoogle Scholar
- Bairoch A, Apweiler R: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucl Acids Res. 1997, 25 (1): 31-36. 10.1093/nar/25.1.31.View ArticlePubMed CentralPubMedGoogle Scholar
- Barrett C, Hughey R, Karplus K: Scoring hidden Markov models. Comput Applic Biosci. 1997, 13: 191-199.Google Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.