- Open Access
Global inference of disease-causing single nucleotide variants from exome sequencing data
© The Author(s). 2016
Published: 23 December 2016
Whole exome sequencing (WES) has recently emerged as an effective approach for identifying genetic variants underlying human diseases. However, considerable time and labour is needed for careful investigation of candidate variants. Although filtration based on population frequencies and functional prediction scores could effectively remove common and neutral variants, hundreds or even thousands of rare deleterious variants still remain. In addition, current WES platforms also provide variant information in flanking noncoding regions, such as promoters, introns and splice sites. Despite of being recognized to harbour causal variants, these regions are usually ignored by current analysis pipelines.
We present a novel computational method, called Glints, to overcome the above limitations. Glints is capable of identifying disease-causing SNVs in both coding and flanking noncoding regions from exome sequencing data. The principle behind Glints is that disease-causing variants should manifest their effect at both variant and gene levels. Specifically, Glints integrates 14 types of functional scores, including predictions for both coding and noncoding variants, and 9 types of association scores, which help identifying disease relevant genes. We conducted a large-scale simulation studies based on 1000 Genomes Project data and demonstrated the effectiveness of our method in both coding and flanking noncoding regions. We also applied Glints in two real exome sequencing and demonstrated its effectiveness for uncovering disease-causing SNVs. Both standalone software and web server are available at our website http://bioinfo.au.tsinghua.edu.cn/jianglab/glints.
Glints is effective for uncovering disease-causing SNVs in coding and flanking noncoding regions, which is supported by both simulation and real case studies. Glints is expected to be a useful tool for human genetics research based on exome sequencing data.
Technical advancement in whole exome sequencing (WES) has enabled the rapid and cost-efficient detection of variants in exonic regions or nearby, promoting the identification of causative variants underlying Mendelian diseases , complex disorders , and cancers . Nevertheless, computational analysis of WES data still remains a great challenge, due to the fact that the number of distinct variants in a study usually increases dramatically with the increase of the size of a disease cohort, and a significant proportion of these variants occur in extremely low frequency . In addition, WES can also discover a non-negligible fraction of variants occurring in flanking regions of captured exons [5, 6]. These regions, with examples including promoters, splice sites and introns, though having been recognized to harbor causal variants , are typically ignored by current analysis pipelines.
Confronting the first challenge, two groups of computational methods have been developed for prioritizing candidate variants from WES data. Specifically, targeting on nonsynonymous single nucleotide variants (nsSNVs), the first group of methods, with such representative examples as SIFT  and PolyPhen2 , utilize sequence, biochemical and evolutionary information of amino acids to predict functionally damaging effects of variants. Some methods, like Condel , are developed to integrate multiple functional predictions to provide more accurate predictions. However, such prediction scores, though having been announced with high accuracy in such public data sets as HGMD , Siwss-prot  and ClinVar , usually have high false positives and low explanatory power in real experimental studies [14, 15]. To overcome this limitation, the second group of methods, represented by eXtasy , SPRING  and snvForest , integrate multiple functional predictions of variants, association information between genes and diseases, as well as phenotype information to prioritize candidate variants. There also exist several methods integrating variant functional predictions and disease-gene association to prioritize disease genes, such as PHIVE  and Phen-Gen . The difference between variant prioritization and gene prioritization is significant as former incorporates disease-gene association into variants while latter aggregates variant functions into genes. However, these methods, though capable of eliminating false positives, usually rely heavily on prior knowledge about the disease under investigation to make inference. For example, SPRING takes a set of seed genes known as associated with the disease of interest as input. In the case that a query disease has never been investigated for genetic basis, genes associated with 10 diseases of the highest phenotype similarities with the query disease are used as seeds. This strategy, though proved to be valid, can hardly be optimal, since the association information between genes and other diseases are all ignored. In other words, this strategy has the local property because only diseases having very high phenotype similarity with the query disease contribute to the inference procedure.
As for the second challenge, the prediction of functionally damaging effects of noncoding variants is much more difficult than coding variants. Unlike variants in coding region, noncoding variants affect biological functions through such complex mechanisms as epigenetic regulation . Fortunately, with the recent development in epigenomics and the release of such large-scale projects as ENCODE  and Roadmap Epigenomics  that aim at dissecting regulatory elements, the prediction of functional effects of noncoding variants has now become feasible, leading to such methods as CADD , FunSeq , GWAVA , DeepSEA , deltaSVM . Nevertheless, to the best of our knowledge, there still lacks a computational method capable of predicting causative noncoding variants for a specific type of disease.
To overcome the above limitations, we propose a novel computational method, called Glints, to prioritize both coding and flanking noncoding variants in a disease-specific manner by integrating 14 types of functional scores and 9 types of association scores. We extracted functional scores for SNVs from dbWGFP , a repository collecting whole genome SNVs and their functional predictions, and devised a multivariate regression model to quantify association scores between candidate genes and diseases of interest. After converting both of functional scores and association scores into p-values, we integrated them with Fisher’s combined probability test. We conducted a large-scale simulation studies based on 1000 Genomes Project Phase I data and demonstrated the effectiveness of our method for identifying causal variants in both coding and flanking noncoding regions. We further compared our method with several existing methods for prioritizing coding nsSNV, and demonstrated the superior performance of our method. We applied our method to two real exome sequencing data, and found that Glints could uncover known causal variants and discover new variants with high causality probabilities. Thus, Glints is expected to contribute to human genetics studies based on exome sequencing, and facilitates our understanding about human diseases.
Overview of Glints
Contributions of our method
Compared with other methods, including our previous work, Glints makes three main contributions: 1) Glints introduces a multivariate regression method for inferring disease-gene association, in which three types of phenotypic similarities (UMLS, HPO, MeSH) are integrated. In contrast, our previous work pgFusion  utilized only one type of phenotypic similarity (UMLS) with a univariate regression method. 2) Glints incorporates disease-gene association obtained by multivariate regression into variant prioritization, and utilize global network information. Previous work, such as eXtasy , SPRING , PHIVE  and snvForest  only utilize disease-gene information locally, such as gene associated several similar diseases, also called “seed genes”. Thus, these methods could be ineffective on some diseases whose similar diseases have no associated genes or “seed genes” are not available. 3) Glints incorporates variant functional predictions and disease-gene association to prioritize flanking noncoding variants in disease-specific manner, which remains unexplored to the best of our knowledge.
Summary statistics for data used in simulated experiment across different regions
The prioritization performance of Glints and individual scores on 1000 Genomes Project based simulated data
For each of the four different regions, we separately spiked each causal variant from the region into the pool of corresponding control SNVs of each individual from 1000 Genomes Project to simulate real sequencing data. We then prioritized the mixed SNVs using Glints and observed the relative positions of causal SNVs in the final ranking list. In order to eliminate bias and possible information leakage, we removed all known genes associated with the query disease to mimic the scenario under which the genetic basis for the query disease was totally unknown. To evaluate the performance of our method quantitatively, we defined the rank ratio of a test variant as the rank of the variant divided by the number of neutral variants and the mean rank ratio (MRR) of a disease as the average rank ratio of causal variants corresponding to that disease. We then took the average MRR for these diseases as a metric to assess overall performance. Since a smaller MRR means that disease-causing variants are enriched in the top-ranking positions, better performance is indicated with smaller MRR. We could also obtain both false-positive and true-positive rates by defining a threshold for rank ratio and easily compute the area under ROC curve (AUC) through varying the threshold. Another important indicator is the number of causal variants which are ranked in the top 10 (TOP), and a higher TOP number means higher performance.
We then evaluated the ability of Glints to prioritize promoter SNVs. To accomplish this, we performed the 1000 Genomes Project-based simulation studies, as noted above. Since each individual harbors 19,055 Promoter SNVs on average (Fig. 2(b)), pinpointing causal Promoter SNVs from this large pool is challenging. Nonetheless, Glints achieved a satisfactory result with about 82 of 114 causal promoter SNVs ranked in the top 10, significantly more than expected by chance. The average MRR is 3.63%, and corresponding AUCs is 96.20%, both better than expected by chance. We also observed significant difference on performance across different populations (p-values < 2 × 10-16 for Top, MRR and AUC), but this difference cannot be attributed to varying size of candidate number across different populations except for Top (Additional file 1: Figure S1).
We next evaluated the ability of Glints to prioritize intronic SNVs by the same method described above. As shown in Fig. 2(c), on average, each individual carries about 2532 intronic variants when considering inner 3-10 bp only, with the single exception of these three African populations, who carry more (p-value < 2 × 10-16). Glints ranked about 219 out of 303 causal variants in the top 10, which was significantly better than expected by chance. The corresponding MRR and AUC were 3.65 and 96.13%, respectively, on average, suggesting the effectiveness of our method.
Finally, Glints was evaluated for its ability to prioritize splice site SNVs, still using the same method as described above. As shown in Fig. 2(d), on average, each individual has 80 splice site variants. Glints ranked 1047 out of 1105 causal splice site variants in the top 10. We observed a greater proportion of splice site variants receiving top ranking as a result of the smaller number of candidates in this region. The corresponding MRR and AUC were 4.06 and 96.43%, respectively, on average, again suggesting the effectiveness of our method.
Comparison with existing methods
Contribution of individual scores
We included 14 types of different functional prediction scores in Glints, and these scores differed in several aspects, such as principles, training data, learning algorithms and applicability etc (Additional file 1: Table S2). Even for the same score, difference may exist when applied in different regions. We first evaluated the correlations between these functional prediction scores and scores with similar underlying learning procedures are expected to show high correlations and tend to cluster together. For each of these four regions, we selected corresponding causal variants used in aforementioned simulation studies and corresponding functional scores. We then computed Pearson’s correlation coefficients between each pair of functional scores. SIFT and LRT were transformed with 1-SIFT and 1-LRT respectively, in order to keep consistency in direction for expressing deleteriousness, e.g. higher values indicating higher deleteriousness.
We assessed the performance of each individual score in each region by repeating the same simulation studies with the score only, as shown in Table 2. For example, several scores were only available for exonic nsSNVs, including LRT, MSRV, SIFT, SinBaD, MutationAccessor and PolyPhen2. Therefore, we did not assess their performance in regions other than exon. All gene scores were not restricted to single variant, thus were available for all regions. From Table 2, we clearly saw the advantages of integrating multiple data sources, which resulted in better performance when compared to all individual scores. For example, in the exonic region, MRRs of individual scores ranged from 5.47 to 36.37%, while corresponding TOPs range from 0 to 3086. In contrast, with integration, Glints achieved an MRR of 2.12% and TOP of 4736. In the promoter region, MRRs of individual scores range from 6.31 to 30.62%, while corresponding TOPs range from 0 to 57. With integration, Glints achieved an MRR of 3.63% and TOP of 82. In the intronic region, MRRs of individual scores ranged from 10.09 to 47.26%, while corresponding TOPs ranged from 0 to 143. With integration, Glints achieved an MRR of 3.65% and a TOP of 219. Finally, in the splice site region, MRRs of individual scores ranged from 9.03 to 35.16%, while corresponding TOPs ranged from 218 to 922. With integration, Glints achieved an MRR of 4.06% and TOP of 1047. We also removed CADD, DANN, FATHMM-MKL and Eigen from Glints due to their usage of 1000 Genomes Project as training data, and the resulting Glints also show better performance than any individual score (Table 2).
Application on real sequencing data
In order to assess the effectiveness of Glints on real sequencing data, we collected two recently published exome sequencing data and applied Glints to them. We only assessed Glints’ performance on coding variants due to difficulty of accessing data for flanking noncoding variants. The first case was a study on epileptic encephalopathies (MIM: 615369), which described a heterogeneous and deleterious group of childhood epilepsy disorders with syndromes associated with severe cognitive and behavioral disturbances. In this study , 264 probands with their parents were recruited for exome sequencing, and strong statistical evidence on the association between de novo mutations with this disorder was found. In total, we collected 192 candidate nonsynonymous de novo mutations from this study and applied Glints to prioritize them with the objective of identifying functional mutations. Of those candidates, 30 mutations were reported to show obvious statistical evidence in the original literature; therefore, those mutations were considered functional. Using the identical procedure as that used in the simulation experiments described above, we removed all known genes associated with this disorder and all genes overlapping those candidate mutations to prevent possible information leakage. In the prioritized list, 23 out of the top 25 were functional, highlighting the capability of our method for this case. A one-sided Fisher’s exact test suggests that the probability of ranking 23 functional mutations among the top 25 by chance is only 2.14 × 10-14, further supporting our method for enriching functional mutations in top positions. Among those top 25 mutations, two mutations were not reported as functional in the original study. One was on gene GABRB1, which was recently proved to be associated with thalamus volume and intelligence . The other one was on gene GNAO1, which was also recently reported to play a significant role in epileptic encephalopathy [60, 61]. In comparison, SPRING ranked 17 functional SNVs among top 25, and the numbers for eXtasy and snvForest were both also 17.
Another case was a study on autism spectrum disorders (ASD, MIM: 209850) and Neale et al.  sequenced the exomes of 175 ASD cases and their parents. Through statistical modeling of de novo mutations in the cohort, several genes were identified as key factors involved in ASD with strong evidence. From this study, we collected a total of 104 synonymous SNVs as candidates, among which five were reported as likely functional in the original study. We applied Glints to this list of candidates using a strategy similar to that described above in order to eliminate possible information leakage. In the final prioritized list, five functional SNVs received ranks of 1, 2, 9, 11 and 38. A one-sided Fisher’s exact test suggests that the probability of ranking 3 functional SNVs in the top 10 by chance is only 0.022. In comparison, SPRING gave these functional variants ranks of 2, 3, 7, 11, 52, while eXtasy gave ranks of 5, 8, 17 and filtered out two functional variants. snvForest gaves ranks of 4,9,10,11,70 to these functional variants. These two real cases both indicated better performance of Glints than the other three methods on real exome sequencing data.
It is also worth noting that several aspects of our method can be improved in the future. First, our method is restricted to flanking noncoding regions that are nearby gene regions. Intergenic regions are not suitable for the application of our method since it is hard to assign gene to variations that locate at these regions. It is technically feasible to apply our method to deeper intronic regions, but we cannot evaluate its performance without available data for this region. With the accumulation of variation data and advance in assigning genes to intergenic variants, our method can be extended to handle these regions. Second, our method is restricted to single nucleotide variants, but several other kinds of variants, such as indel, structural variation etc, are also important for human diseases. How to extend our methodology to other forms of variations is one research direction for future. In addition, our method for integration can also inspire methodological developments for integration of other types of biological data. The volume of genomic and genetic data has increasingly accumulated, but how to integrate such bulky data to distill meaningful biological insights is far from trivial. Analogy to our method, each type of genomic data can be converted into p-values followed by integration with weighted Fisher’s method. The combined p-values represent collective evidence from a variety of data, and can effectively reduce false positives compared with single type of data.
One major challenge for developing computational methods for identifying causative variants is the scarcity of public real sequencing data. After surveying hundreds of literatures, we find only two exome sequencing data and real data for noncoding variants is not available. In the future, increasing number of public sequencing datasets will benefit the methodological development.
In this study, we present a novel computational method, called Glints, to prioritize both coding and flanking noncoding SNV with respect to the query disease in exome sequencing studies. It can also be useful in whole sequencing studies if only coding and flanking noncoding variants are focused. Our method integrates 14 types of functional prediction scores for variants, including predictions for both coding and noncoding regions, and 9 types of association scores which quantify the association between genes hosting candidate variants and diseases of interest. Based on large-scale simulation studies, we conclude that our method has satisfactory performance and competitive accuracy over existing methods. It is expected that Glints can serve as a useful tool in human genetics studies based on exome sequencing, and it can save time and cost for follow-up experimental studies and facilitate discovery of disease-causing variations.
Multivariate linear regression for association inference
Calibration of p-values
We used 14 types of functional prediction scores on variant-level and 9 types of association scores on gene-level. Those scores are heterogeneous for quantities and implications, which makes it difficult to integrate them directly. Therefore, before integration, we converted all those scores into p-values, which can be integrated with Fisher’s method.
For each type of variant-level score, we first sorted all available scores stored in a database (e.g. dbWGFP) and built an empirical null distribution after excluding known causal variants. Although some unknown causal variants may still exist, their impact on the estimation of the empirical null distribution is negligible, due to the low odds of causal to neutral variants, which results from natural selection. Then, we compared a query score with the corresponding empirical null distribution and calculated the proportion of more extreme scores as the empirical p-value. For SIFT and LRT, smaller scores indicate higher deleteriousness, and hence “more extreme” means smaller than the query score. For all the other scores, larger scores indicate higher deleteriousness, and “more extreme” means greater than the query score.
For gene-level scores, it is also necessary to calculate empirical p-value. Although we can obtain analytical p-values from regression analysis as detailed above, those p-values can be biased when the underlying assumption is violated. We therefore built an empirical null distribution for each type of association score from corresponding analytical p-values of neutral genes, which are not reported as causative for any diseases. Then, for each gene-level analytical p-value, we compared it with corresponding empirical null distribution and calculate the proportion of more extreme values as empirical p-values. Here “more extreme” means smaller than the analytical p-value of the query gene.
Fisher’s method with dependence correction
We also calculated q-values  for the combined p-values to control positive false discovery rate (pFDR), which showed significant improvement in power in some studies compared with the traditional Benjamini-Hochberg approach . It is also desirable that our method can easily handle the missing data source problem, in which we decreased the total number of p-values to be combined.
We thank Dr. Jiaxin Wu for useful discussions.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 17, 2016: Proceedings of the 27th International Conference on Genome Informatics: bioinformatics. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-17.
This research was partially supported by the National Basic Research Program of China (2012CB316504), the National High Technology Research and Development Program of China (2012AA020401), the National Natural Science Foundation of China (61573207, 61175002), The Recruitment Program of Global Experts of China, and Tsinghua National Laboratory for Information Science and Technology. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Funding for the publication costs have been provided by the National Basic Research Program of China (2012CB316504).
Availability of data and materials
Glints software and webserver are available from http://bioinfo.au.tsinghua.edu.cn/jianglab/glints.
RJ and TC desgined this project, MMW wrote programs and produced main results, and RJ, MMW wrote manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–55.View ArticlePubMedGoogle Scholar
- Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF, Moran JL. Exome sequencing and the genetic basis of complex traits. Nat Genet. 2012;44(6):623–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Kan J, Yuen ST, Shi ST, Chu KM, Law S, Chan TL, Kan Z, Chan AS, Tsui WY. Exome sequencing identifies frequent mutation of ARID1A in molecular subtypes of gastric cancer. Nat Genet. 2011;43(12):1219–23.View ArticlePubMedGoogle Scholar
- Cooper GM, Shendure J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet. 2011;12(9):628–40.View ArticlePubMedGoogle Scholar
- Tian G, Wang J, Wang J, Yang H, Zhang X. Comprehensive comparison of three commercial human whole-exome capture platforms. 2011.Google Scholar
- Meienberg J, Zerjavic K, Keller I, Okoniewski M, Patrignani A, Ludin K, Xu Z, Steinmann B, Carrel T, Röthlisberger B. New insights into the performance of human whole-exome capture platforms. Nucleic Acids Res. 2015;43(11):e76. doi:10.1093/nar/gkv216.View ArticlePubMedPubMed CentralGoogle Scholar
- Vinagre J, Almeida A, Pópulo H, Batista R, Lyra J, Pinto V, Coelho R, Celestino R, Prazeres H, Lima L. Frequency of TERT promoter mutations in human cancers. Nat Commun. 2013;4:2185.View ArticlePubMedGoogle Scholar
- Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7):1073–81.View ArticlePubMedGoogle Scholar
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–9.View ArticlePubMedPubMed CentralGoogle Scholar
- González-Pérez A, López-Bigas N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am J Hum Genet. 2011;88(4):440–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, Thomas NS, Abeysinghe S, Krawczak M, Cooper DN. Human gene mutation database (HGMD®): 2003 update. Hum Mutat. 2003;21(6):577–81.View ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter M-C, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31(1):365–70.View ArticlePubMedPubMed CentralGoogle Scholar
- Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014;42(D1):D980–5.View ArticlePubMedGoogle Scholar
- Miosge LA, Field MA, Sontani Y, Cho V, Johnson S, Palkova A, Balakishnan B, Liang R, Zhang Y, Lyon S. Comparison of predicted and actual consequences of missense mutations. Proc Natl Acad Sci. 2015;112(37):E5189–98.View ArticlePubMedPubMed CentralGoogle Scholar
- Itan Y, Casanova J-L. Can the impact of human genetic variations be predicted? Proc Natl Acad Sci. 2015;112(37):11426–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Sifrim A, Popovic D, Tranchevent L-C, Ardeshirdavani A, Sakai R, Konings P, Vermeesch JR, Aerts J, De Moor B, Moreau Y. eXtasy: variant prioritization by genomic data fusion. Nat Methods. 2013;10(11):1083–4.View ArticlePubMedGoogle Scholar
- Wu J, Li Y, Jiang R. Integrating multiple genomic data to predict disease-causing nonsynonymous single nucleotide variants in exome sequencing studies. PLoS Genet. 2014;10(3):e1004237.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu M, Wu J, Chen T, Jiang R. Prioritization of nonsynonymous single nucleotide variants for exome sequencing studies via integrative learning on multiple genomic data. Sci Rep. 2015;5:14955.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson PN, Köhler S, Oellrich A, Wang K, Mungall CJ, Lewis SE, Washington N, Bauer S, Seelow D, Krawitz P. Improved exome prioritization of disease genes through cross-species phenotype comparison. Genome Res. 2014;24(2):340–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Javed A, Agrawal S, Ng PC. Phen-Gen: combining phenotype and genotype to analyze rare disorders. Nat Methods. 2014;11(9):935–7.View ArticlePubMedGoogle Scholar
- Li MJ, Yan B, Sham PC, Wang J. Exploring the function of genetic variants in the non-coding genomic regions: approaches for identifying human regulatory variants affecting gene expression. Brief Bioinform. 2014;16(3):393–412. doi:10.1093/bib/bbu018.View ArticlePubMedGoogle Scholar
- Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74.View ArticleGoogle Scholar
- Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ. Integrative analysis of 111 reference human epigenomes. Nature. 2015;518(7539):317–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A. Integrative annotation of variants from 1092 humans: application to cancer genomics. Science. 2013;342(6154):1235587.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie GR, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods. 2014;11(3):294–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015;47(8):955–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu J, Wu M, Li L, Liu Z, Zeng W, Jiang R. dbWGFP: a database and web server of human whole-genome single nucleotide variants and their functional predictions. Database. 2016;2016. doi: 10.1093/database/baw024Google Scholar
- Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16):e164.View ArticlePubMedPubMed CentralGoogle Scholar
- Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2014;31(5):761–3. doi:10.1093/bioinformatics/btu703.View ArticlePubMedPubMed CentralGoogle Scholar
- Shihab HA, Rogers MF, Gough J, Mort M, Cooper DN, Day IN, Gaunt TR, Campbell C. An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics. 2015;31(10):1536–43.View ArticlePubMedPubMed CentralGoogle Scholar
- Ionita-Laza I, McCallum K, Xu B, Buxbaum JD. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat Genet. 2016;48(2):214–20. doi:10.1038/ng.3477.View ArticlePubMedPubMed CentralGoogle Scholar
- Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol. 2010;6(12):e1001025.View ArticlePubMedPubMed CentralGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15(8):1034–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X. Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics. 2009;25(12):i54–62.View ArticlePubMedPubMed CentralGoogle Scholar
- Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15(7):901–13.View ArticlePubMedPubMed CentralGoogle Scholar
- Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res. 2011;39(17):e118. doi:10.1093/nar/gkr407.View ArticlePubMedPubMed CentralGoogle Scholar
- Chun S, Fay JC. Identification of deleterious mutations within three human genomes. Genome Res. 2009;19(9):1553–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang R, Yang H, Zhou L, Kuo C-CJ, Sun F, Chen T. Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. Am J Hum Genet. 2007;81(2):346–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Lehmann K-V, Chen T. Exploring functional variant discovery in non-coding regions with SInBaD. Nucleic Acids Res. 2013;41(1):e7.View ArticlePubMedGoogle Scholar
- Lindberg DA, Humphreys BL, McCray AT. The Unified Medical Language System. Methods Inf Med. 1993;32(4):281–91.PubMedGoogle Scholar
- Lipscomb CE. Medical subject headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265.PubMedPubMed CentralGoogle Scholar
- Köhler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J. The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42(D1):D966–74.View ArticlePubMedGoogle Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004;101(16):6062–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang R, Gan M, He P. Constructing a gene semantic similarity network for the inference of disease genes. BMC Syst Biol. 2011;5 Suppl 2:S2.View ArticlePubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Betel D, Wilson M, Gabow A, Marks DS, Sander C. The microRNA. org resource: targets and expression. Nucleic Acids Res. 2008;36 suppl 1:D149–53.PubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths‐Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL. The Pfam protein families database. Nucleic Acids Res. 2004;32 suppl 1:D138–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2004;32 suppl 1:D115–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Cui Q, Ma Y, Jaramillo M, Bari H, Awan A, Yang S, Zhang S, Liu L, Lu M, O'Connor‐McCourt M. A map of human cancer signaling. Mol Syst Biol. 2007;3(1):152.PubMedPubMed CentralGoogle Scholar
- Matys V, Fricke E, Geffers R, Gößling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV. TRANSFAC®: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003;31(1):374–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang JJ. Distribution of Fisher’s combination statistic when the tests are dependent. J Stat Comput Simul. 2010;80(1):1–12.View ArticleGoogle Scholar
- Storey JD. The positive false discovery rate: a Bayesian interpretation and the q-value. Ann Stat. 2003;31(6):2013–35.View ArticleGoogle Scholar
- Jiang R, Wu M, Li L. Pinpointing disease genes through phenomic and genomic data fusion. BMC Genomics. 2015;16 Suppl 2:S3.View ArticleGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent L-C, De Moor B, Marynen P, Hassan B. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537–44.View ArticlePubMedGoogle Scholar
- Quang D, Chen Y, Xie X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics. 2015;31(5):761–3.View ArticlePubMedGoogle Scholar
- Consortium EK. De novo mutations in epileptic encephalopathies. Nature. 2013;501(7466):217–21.View ArticleGoogle Scholar
- Zhu B, Chen C, Xue G, Lei X, Li J, Moyzis RK, Dong Q, Lin C. The GABRB1 gene is associated with thalamus volume and modulates the association between thalamus volume and intelligence. Neuroimage. 2014;102:756–63.View ArticlePubMedGoogle Scholar
- Nakamura K, Kodera H, Akita T, Shiina M, Kato M, Hoshino H, Terashima H, Osaka H, Nakamura S, Tohyama J. De Novo mutations in GNAO1, encoding a Gα o subunit of heterotrimeric G proteins, cause epileptic encephalopathy. Am J Hum Genet. 2013;93(3):496–505.View ArticlePubMedPubMed CentralGoogle Scholar
- Saitsu H, Fukai R, Ben-Zeev B, Sakai Y, Mimaki M, Okamoto N, Suzuki Y, Monden Y, Saito H, Tziperman B. Phenotypic spectrum of GNAO1 variants: epileptic encephalopathy to involuntary movements with severe developmental delay. Eur J Hum Genet. 2015;24(1):129–34.View ArticlePubMedPubMed CentralGoogle Scholar
- Neale BM, Kou Y, Liu L, Ma’Ayan A, Samocha KE, Sabo A, Lin C-F, Stevens C, Wang L-S, Makarov V. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature. 2012;485(7397):242–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Altshuler D, Daly M, Kruglyak L. Guilt by association. Nat Genet. 2000;26(2):135–8.View ArticlePubMedGoogle Scholar
- Wu X, Jiang R, Zhang MQ, Li S. Network-based global inference of human disease genes. Mol Syst Biol. 2008;4(1):189.PubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol. 1995;57(1):289–300.Google Scholar