- Open Access
GentrepidV2.0: a web server for candidate disease gene prediction
BMC Bioinformatics volume 14, Article number: 249 (2013)
Candidate disease gene prediction is a rapidly developing area of bioinformatics research with the potential to deliver great benefits to human health. As experimental studies detecting associations between genetic intervals and disease proliferate, better bioinformatic techniques that can expand and exploit the data are required.
Gentrepid is a web resource which predicts and prioritizes candidate disease genes for both Mendelian and complex diseases. The system can take input from linkage analysis of single genetic intervals or multiple marker loci from genome-wide association studies. The underlying database of the Gentrepid tool sources data from numerous gene and protein resources, taking advantage of the wealth of biological information available. Using known disease gene information from OMIM, the system predicts and prioritizes disease gene candidates that participate in the same protein pathways or share similar protein domains. Alternatively, using an ab initio approach, the system can detect enrichment of these protein annotations without prior knowledge of the phenotype.
The system aims to integrate the wealth of protein information currently available with known and novel phenotype/genotype information to acquire knowledge of biological mechanisms underpinning disease. We have updated the system to facilitate analysis of GWAS data and the study of complex diseases. Application of the system to GWAS data on hypertension using the ICBP data is provided as an example. An interesting prediction is a ZIP transporter additional to the one found by the ICBP analysis. The webserver URL is https://www.gentrepid.org/.
The identification of genes implicated in human disease enables an understanding of disease mechanisms and is essential for the development of diagnostics and therapeutics. While genetic approaches such as linkage mapping or genome-wide association studies (GWAS) can successfully identify genomic regions linked to a particular disease, identification of the disease-causing genes is still a difficult task. For linkage analysis, genomic regions are large, as these analyses look at inheritance over several generations in a known pedigree which may have few recombination events. Genetic association studies, which do not focus on recent observable ancestry but rather look at historic recombination, have shorter disease-associated regions . However, when applied at the genome-wide scale (eg. GWAS), the large number of markers used requires stringent correction for multiple testing, limiting the number of reliably identified candidate genes. Lower scoring markers may still indicate potential disease genes but may also be false positives. Analysis of this valuable but noisy data would benefit from a candidate disease gene prioritization approach .
A variety of publicly available systems employ various methodologies to map phenotype to genotype in order to predict or rank candidate disease genes [3, 4]. Many of these systems are available as web services. Because oligogenic diseases are associated with disruption of genes that have similar functions , the most prevalent method employed is gene clustering . Genes are clustered based on shared features, such as common domains, similar functional annotation, involvement in the same protein complex or signaling pathways, co-expression, or combinations of these [6-11].
Here we present the Gentrepid web server, a public candidate disease gene prediction system that associates genes with specified phenotypes using genetic and biomolecular data (Figure 1). Gentrepid draws on two gene clustering methods to make candidate gene predictions; the Common Pathway Scanning (CPS) and Common Module Profiling (CMP) approaches . Both methods identify links between genes in loci associated with a disease phenotype. CPS is based primarily on protein interaction data, whereas CMP is based on sequence data. Gentrepid can be assisted by phenotype-associated genes as seeds (seeded mode), or can work in the absence of disease gene knowledge using only phenotype-associated loci (ab initio mode). The system can be applied to both Mendelian and complex diseases [4, 6, 10].
Construction and content
Underlying databases and data sources
The relational database underlying Gentrepid was created using the PostgreSQL database management system with current biological data on human proteins and genes, including pathways, interaction data and domains. The core of the database is built on gene annotation and sequence data of the NCBI reference assembly (build 36) from RefSeq , with corresponding transcript (mRNA) and protein information. Protein-protein interaction (PPI) data are gathered from the Interlogous Interaction Database (I2D, formerly known as the Online Predicted Human Interaction Database, or OPHID ). I2D contains literature-derived data from multiple databases including BIND , MINT  and HPRD . Pathway data are retrieved from BioCarta  and KEGG , which contain information on signalling and metabolic pathways respectively. Single nucleotide polymorphisms (SNPs), sequence tagged sites (STSs) and cytogenic data which are included as marker positions for input of genetic loci, are pooled from dbSNP (build 131 at the time of the analyses) , dbSTS and Map Viewer from NCBI . Human disease gene information is extracted from the Online Mendelian Inheritance in Man (OMIM)  Morbid Map flat file.
Data generation and processing
For all proteins, several sequence features are pre-calculated including signal peptide cleavage predictions by SignalP , coiled-coil domain predictions by Marcoil  and Multicoil , and predictions of transmembrane helices by TMHMM . These data are used to provide more information about phenotype-related candidate proteins identified in the analyses. The OMIM  morbid map file is remediated to standardize phenotype names. In addition, simple phenotype clustering is implemented using similar text/character strings. For example, the multiple entries for Alzheimer’s disease subtypes are merged into one phenotype based on the text string “Alzheimer”. The purpose of this simple phenotype clustering is to capture as much relevant genetic information on the phenotype as possible . Users have the ability to deconstruct phenotype clusters during the prediction process if desired. For the CMP algorithm, domain annotation of the proteome is performed by parsing all protein sequences against the Pfam library of Hidden Markov models  using HMMer . Pairwise similarity scores between common domains of proteins are calculated using the Smith-Waterman algorithm implemented in SSEARCH .
Quality control and updates
Data is sourced from publicly available databases that are constantly updated. In order to maintain an up-to-date version of the system, data is updated every 6 months, or when a major release of other data sources is made available. Once the data is updated, the system is benchmarked to check for consistency by running validation tests .
Prior disease knowledge and the ab initioapproach
Gentrepid functions under two input modes, seeded and ab initio. Seeded mode is assisted by phenotype-associated genes from OMIM  as seeds. These seed genes help define the phenotype-specific data that Gentrepid bases its predictions upon. This approach was shown to work successfully . Gentrepid additionally allows candidate disease genes to be prioritized across multiple disease-associated loci in the absence of known disease genes (ab initio mode). In this mode, common domains and common pathways linking candidate disease genes from different loci are determined, and the degree of overrepresentation of specific domain combinations or pathways across disease-associated loci is used to make predictions. Candidate genes are then ranked according to the probability that these overrepresented domain combinations or pathways would occur randomly. The ab initio approach is especially useful when little is known about the disease phenotype, and can potentially discover novel biological mechanisms underpinning the disease.
Common pathway scanning
The Common Pathway Scanning (CPS) approach is a Systems Biology method that is based on the finding that genes for a specific phenotype are more likely to interact with other phenotype-specific genes. More specifically, these phenotype-specific proteins are involved in the same pathway or complex [5, 9, 29]. Utilising this knowledge, potential disease genes are predicted by searching implicated disease loci for proteins that are part of the same pathway or complex.
The CPS algorithm in seeded mode uses the phenotype-specific disease genes to associate pathways with the phenotype . In seeded mode, genes within the candidate locus are checked for their occurrence in pathways associated with the seed gene. For each disease, pathways are ranked by the significance of the association, as determined by the Fisher’s exact test statistic. The p-value is calculated by partitioning genes in the phenotype-associated loci based on whether they are associated with the pathway in question or not . Candidate genes are filtered and prioritized based on the lowest p-value score of the pathway they share. The default significance threshold is p path < 0.05.
The CPS algorithm in ab initio mode requires multiple associated loci as input. In this mode, the candidate loci are searched for genes participating in common pathways. These common pathways are then ranked according to the disease-relevance of their associated pathways in the same manner as in seeded mode.
Common module profiling
The second approach, Common Module Profiling (CMP), distinguishes Gentrepid from other candidate disease gene prediction approaches, as it is based on the use of sequence-based domain similarity. Predicated on the principle that similar diseases are often caused by families of genes with related functions , it uses domain-based comparative sequence analysis to identify proteins with potential functional similarity. A domain-based comparison of proteins has several advantages over comparison of the full-length amino acid sequence of proteins. Because structure is conserved over sequence, domain-based sequence comparison searches have been shown to be more accurate than full-length sequence searches , improving the sensitivity of the candidate gene search. Dissection of a protein into domains provides a more fine-grained approach to identifying aspects of protein function affected in the disease. Using the Pfam library of Hidden Markov models , domains can be assigned to approximately 69% of human proteins which allows functional inference for around 54% of human genes. While some other web tools reference protein domains [7, 11], only Gentrepid uses domain sequence comparison rather than keyword matching. As a result, relationships between similar but differently labelled domains can also be captured.
The CMP algorithm in seeded mode uses pairwise alignments of Pfam domains common to the candidate gene and the known disease gene. Candidate genes are scored based on their domain similarity to a known disease gene, and prioritized according to this score .
In ab initio mode, the CMP algorithm predicts and prioritises genes in a different manner. In this method, domain combinations are tested for over-representation in the associated loci compared to the genome as a whole through upper and lower significance tests . Results are prioritized based on χ2 min which has been shown to give a better estimate of the significance .
Data input is a relatively straightforward step-wise process. All that is required is at least one phenotype-associated locus and a known disease gene, or multiple phenotype-associated loci. Gentrepid allows disease intervals to be specified by STSs, SNPs or cytogenetic bands in one of two ways: two markers specifying a start and stop site can be entered; or a central marker and an interval width can be used (Figure 1, bottom right panel). Both new diseases and new disease genes not currently included in the underlying Gentrepid database can be added by the user and incorporated in the analysis. New phenotypes may be created by the user if desired, in which case existing OMIM data will not be accessed. New disease genes are entered into the system by selecting a HUGO gene name or Entrez ID. Additionally, any of Gentrepid’s pre-loaded phenotype-associated genes can be deselected for the current project. As some phenotypes described in Gentrepid reference more than one OMIM ID, genes that an individual researcher believes to be anomalous can be removed from the project. Usage of Gentrepid is outlined in a help page reachable from the start page.
GWA input upgrade
The system was originally designed for input from linkage-mapping loci determined for Mendelian diseases. Such investigations generally involve relatively few loci, making it feasible to manually enter each locus separately into the web server. This applies similarly to the investigation of relatively small numbers of loci associated with highly significant SNPs in GWAS of complex diseases. However, candidate disease gene prediction tools such as Gentrepid are particularly useful when applied to larger numbers of loci. For example, when lower significance thresholds are applied to the GWAS SNPs in order to discover candidates from weaker associations . To facilitate this, in version 2.0 of the system, users can enter multiple loci simultaneously by pasting a list into the text box.
Once a phenotype and an interval, or multiple intervals, have been entered, Gentrepid displays the known disease genes and the genes in the selected regions, both graphically and in tabular format (Figure 1, top left panel). The graphical display utilizes Flash GViewer, a freely available flash program developed by the GMOD project (http://gmod.org/wiki/Flash_GViewer). It shows an overview of the genome, highlighting the individual genes and intervals on the chromosomes that have been selected for the project. From this page, users can choose to perform a CPS or CMP candidate disease gene analysis.
For the CPS analysis, the user can select which preloaded genes and intervals from OMIM to include in the analysis, as well as which pathway databases to include. By default all genes, intervals and pathway databases are used. The analysis is performed, and a list of candidate disease gene predictions is displayed in table format, along with the interval, the known disease genes located in the same pathway upon which the prediction was made, and the pathway source database (Figure 1, bottom left panel).
For the CMP analysis, users have the option of selecting genes and intervals to include as for CPS, and may choose the cut-off scores for the domain comparisons. The results page displays a table of predicted candidate disease genes and their locations, along with the disease gene(s) with similar domains.
For both CPS and CMP, further information about genes of interest can be found on the gene information pages, which can be accessed by clicking on the hyperlinked gene names on the results page. Gene pages are compilations of in-house bioinformatic analyses of gene products, with protein isoforms displayed under individual tabs. Protein domains, coils, and transmembrane helices are all displayed as an image, with links to the text-based results of the Pfam , SignalP , Marcoil , Multicoil , and TMHMM  output. The mRNA sequence can also be accessed, and if applicable, current protein isoform sequences. Both are in FASTA format. The gene page links to relevant entries in the Entrez Gene  and GeneCards  webservers, as well as the UCSC Genome Browser .
User account-based project management
Gentrepid can be accessed in two ways: immediately through Quicksearch or via login after registering for an account. Both are accessible via the homepage. With Quicksearches, results are generated and immediately displayed. Quicksearch results can be saved locally by the user, but they are not stored on the webserver. In order to save results on the webserver for future use, a login account is required. Login accounts are available free of charge to users from academic institutions or non-profit research institutes. They allow users to save their input and access their project at a later date or from a different client, enabling user mobility and collaboration. With an account, users can also create, edit or delete multiple projects. User data are kept in individual password-protected PostgreSQL databases and are not accessible by other accounts. Data transfer to and from the website is also encrypted and protected by a security certificate. Projects remain available until explicitly removed by the user, and can be accessed through tabs along the top panel. Data input and retrieval are analogous to the Quicksearch version.
Utility and discussion
The system can be used on both linkage and genome-wide association data. Linkage data from family studies associates much larger regions of the genome with the disease or phenotype. Similar but not identical to an enrichment analysis, Gentrepid’s modules can be employed to sift through these regions and select genes that warrant further study. In the case of GWAS data, the tool can be used to search the genetic data holistically for affected pathways or molecular mechanisms. GWAS use SNPs in linkage disequilibrium (LD) to pinpoint phenotype-associated areas of the genome. When employing Gentrepid, genes within the vicinity of the associated SNP, not necessarily in LD, are analysed. The associated SNPs may be identifying haplotype blocks which contain control regions affecting the distal protein coding gene in cis. Other systems such as GRAIL  allow users a similar flexibility by requesting either an interval, a gene list or a set of SNPs.
The usefulness of each module for candidate gene prediction seems to depend on the heritability of the phenotype. For Mendelian diseases, CPS is the more effective of the two modules. Because of the small number of loci typically involved in Mendelian disease, it works best when there is some prior knowledge of the genes underlying the phenotype. For complex diseases, where it is believed multiple genes are involved, both modules, common pathways and common functional domains, appear to be effective. Provided enough significant loci are supplied, significant predictions can be made for complex diseases either with or without prior knowledge .
Validation studies of the Gentrepid approach have been published previously for both Mendelian diseases  as well as complex diseases . To summarize, for Mendelian diseases benchmark tests on a set of 170 disease genes for 29 diseases showed that the CMP and CPS methods have a combined sensitivity of 0.52 and a specificity of 0.97, and reduce the candidate list by 13-fold . These tests were performed using artificially constructed loci of 50, 100 or 150 genes around the target disease genes, numbers which are typical of linkage-based disease loci.
Complex diseases are more difficult to benchmark due to limited knowledge of the underlying genetics. In a comparison using loci previously determined by linkage analysis against 11 highly significant Type 2 Diabetes GWAS SNPs, Gentrepid had a sensitivity of 0.18 and a specificity of 0.96 while reducing the candidate list by 19-fold . In a second benchmark on seven complex diseases using GWAS SNP data, Gentrepid was capable of extracting known disease genes and predicting plausible novel disease genes in known and novel loci. Depending on the size of the search space used, the system had a sensitivity that ranged between 0.09 and 1, specificity between 0.55 and 1, and enrichment ratios up to 25-fold .
In addition to these benchmark tests, we have used Gentrepid in-house to successfully predict a novel gene for autosomal recessive spondylocostal dysostosis  and analyse mutations in MESP2, LNFG and HES7. An advantage of Gentrepid’s biological clustering approach is that, as it does not use machine learning algorithms, it does not rely on training data sets and thus avoids concomitant problems such as model overfitting .
Case study: hypertension and blood pressure
Hypertension (HTN) is a medical condition where blood pressure in the arteries is significantly elevated resulting in increased risk for cardiovascular disease, kidney disease or stroke. To demonstrate the application of the Gentrepid system in the analysis of GWAS data, we ran our analyses on a set of 29 significantly associated loci from a meta-GWAS performed by the International Consortium for Blood Pressure (ICBP) [38-40] where the phenotypes investigated were hypertension (HTN), systolic blood pressure (SBP), and diastolic blood pressure (DBP). The study reported 45 likely candidate and confirmed disease genes (Additional file 1: Table S1). We collated a set of 23 HTN-implicated genes from OMIM as seeds (Additional file 1: Table S2), using only genes known prior to the publication of GWAS so as not to skew our results. The seed genes are involved in pathways that regulate blood pressure and blood volume such as: calcium signalling, the renin-angiotensin system, and hormone metabolism [41-43]. Using the 29 significant loci reported, we generated a gene search space based on SNP-gene proximity labelled the adjacent approach where we pool genes adjacent to the associated SNP both upstream and downstream on the + and - strands . We also generated a second gene search space based on SNP-gene distance, labelled the bystander approach. We used a 1Mbp window centred on each SNP and pooled genes within that interval. We then ran Gentrepid in seeded mode, with the 23 seed genes; and in ab initio mode, where no additional genotype/phenotype information is used. We report the predictions made by Gentrepid. We compared these predictions to the reported candidates from the study; which themselves, may or may not be the causal gene. Predictions different to the reported genes in the ICBP study  are annotated as alternate predictions.
From the 29 loci implicated by the ICBP study , the adjacent mapping generated a search space of 77 genes, as some loci mapped to fewer than 4 genes. From these 77 genes, Gentrepid returned 19 gene predictions for 15 of the 29 loci (Table 1). Of these 19 predictions, 12 were gene candidates reported by the ICBP study to be the likely disease genes. Gentrepid thus made 7 alternate gene predictions in 7 loci. Of these alternate predictions, many are in generic signalling pathways (eg. MAPK signalling, p = 0.08). Although the mitogen-activated protein kinase (MAPK) cascade is involved in various cellular functions including vascular oxidative stress, it is debated whether this is solely a symptom of hypertension . The study suggests that variant alleles of the MAPK signalling pathway predispose individuals to hypertension.
The 1Mbp bystander approach generated a search space of 386 genes. Gentrepid returned predictions for 25 of the 29 implicated loci, with a total of 108 gene predictions (results not shown). Of the 25 loci with predictions, 15 gene predictions were congruent with those reported by the ICBP study. Alternate predictions of note include the nuclear factor NF-kappa-B (NFKB1), predicted via the “Corticosteroids and cardioprotection” pathway (p = 9.99e-5). Corticosteroids exert a variety of actions by binding to the glucocorticoid receptor and may play a role in increased water excretion in the kidneys to reduce blood volume and atrial pressure . In addition to the ZIP domain ion transporter SLC39A8 reported by the ICBP, Gentrepid CMP predicted the solute carrier SLC39A12 that, like SLC39A8, has a Zinc transporter domain (Zip, Pfam PF02535). The ICBP study postulated that SLC39A8 may play a role in disease through its transport of cadmium [46, 47].
As demonstrated, both SNP/gene mapping approaches are useful to obtain likely candidates. We recommend the adjacent approach as a first step, and then the 1Mbp bystander approach can be used to generate possible candidates further from the associated locus.
Limitations and future directions
Gentrepid, as with all tools in the field, is limited by the current knowledge database. In an effort to compensate for the missing data, Gentrepid uses multiple modules (CPS, CMP) which gather data from multiple sources (protein-protein interactions, pathways, protein domains). Other protein function data that has been employed include tissue-specific expression levels [48-50], post-translational modifications , phylogenetic lineage , and other ontological classifications . With the publication of regulatory annotation data such as those from the ENCODE project , a future direction for the tool would be to use such regulatory information as a likely candidate disease gene prediction module. Genes targeted by common transcription factors, or miRNAs, amongst other elements that affect gene expression, have also been disease associated and would be a useful update to the system [55, 56].
The Gentrepid web server facilitates the prediction and prioritization of candidate disease genes for both Mendelian and complex diseases using two complementary approaches, namely Common Pathway Scanning and Common Module Profiling. The ability to apply different approaches separately enables the application of different prioritization strategies to different categories of disease—for instance, the optimal approach for the prioritization of candidate disease genes for Mendelian diseases might differ from that for complex diseases, or for cancer. The alternate predictions made by Gentrepid for hypertension and blood pressure traits are interesting candidates that require further validation. The system has shown to be capable of both replicating known or reported candidates and also making novel plausible predictions, demonstrating the usefulness of Gentrepid in de novo analysis and reanalysis of GWAS data. In the future, the addition and integration of yet more data types will further increase the utility of Gentrepid in candidate disease gene prediction and prioritization, for all types of diseases.
Availability and requirements
Gentrepid is available at https://www.gentrepid.org. It requires no special or additional data sources, other than the input data (genetic loci) from the user.
Common Pathway Scanning
Common Module Profiling
Diastolic blood pressure
Genome-wide association studies
Online Mendelian Inheritance in Man
Systolic blood pressure
Single nucleotide polymorphism
Sequence tag sites.
Cardon LR, Bell JI: Association study designs for complex diseases. Nat Rev Genet. 2001, 2 (2): 91-99.
Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2010, 86 (1): 6-22. 10.1016/j.ajhg.2009.11.017.
Tranchevent LC, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, Moreau Y: A guide to web tools to prioritize candidate genes. Brief Bioinform. 2011, 12 (1): 22-32. 10.1093/bib/bbq007.
Ballouz S, Liu J, Oti M, Gaeta B, Fatkin D, Bahlo M, Wouters M: Analysis of genome-wide association study data using the protein knowledge base. BMC Genet. 2011, 12 (1): 98-
Badano JL, Katsanis N: Beyond mendel: an evolving view of human genetic disease transmission. Nat Rev Genet. 2002, 3 (10): 779-789.
Teber E, Liu J, Ballouz S, Fatkin D, Wouters M: Comparison of automated candidate gene prediction systems using genes implicated in type 2 diabetes by genome-wide association studies. BMC Bioinforma. 2009, 10 (Suppl 1): S69-10.1186/1471-2105-10-S1-S69.
Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard BS: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics. 2006, 22 (6): 773-774. 10.1093/bioinformatics/btk031.
Alarcón-Riquelme M: Role of RUNX in autoimmune diseases linking rheumatoid arthritis, psoriasis and lupus. Arthritis Res Ther. 2004, 6: 169-173. 10.1186/ar1203.
Franke L, Van Bakel H, Fokkens L, De Jong ED, Egmont-Petersen M, Wijmenga C: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006, 78 (6): 1011-1025. 10.1086/504300.
George RA, Liu JY, Feng LL, Bryson-Richardson RJ, Fatkin D, Wouters MA: Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucleic Acids Res. 2006, 34 (19): e130-10.1093/nar/gkl707.
Tranchevent LC, Barriot R, Yu S, Van Vooren S, Van Loo P, Coessens B, De Moor B, Aerts S, Moreau Y: ENDEAVOUR update: a web resource for gene prioritization in multiple species. Nucleic Acids Res. 2008, 36 (suppl_2): W377-W384.
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007, 35 (suppl_1): D61-D65.
Brown KR, Jurisica I: Online predicted human interaction database. Bioinformatics. 2005, 21 (9): 2076-2082. 10.1093/bioinformatics/bti273.
Bader GD, Betel D, Hogue CWV: BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003, 31 (1): 248-250. 10.1093/nar/gkg056.
Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Lett. 2002, 513 (1): 135-140. 10.1016/S0014-5793(01)03293-8.
Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TKB, Gronborg M, et al: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res. 2003, 13 (10): 2363-2371. 10.1101/gr.1680803.
Nishimura D: BioCarta. Biotech Software & Internet Report. 2001, 2 (3): 117-120. 10.1089/152791601750294344.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32: D277-D280. 10.1093/nar/gkh063.
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001, 29 (1): 308-311. 10.1093/nar/29.1.308.
Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 2002, 30 (1): 52-55. 10.1093/nar/30.1.52.
Emanuelsson O, Brunak S, Von Heijne G, Nielsen H: Locating proteins in the cell using TargetP, SignalP and related tools. Nat Protoc. 2007, 2 (4): 953-971. 10.1038/nprot.2007.131.
Delorenzi M, Speed T: An HMM model for coiled-coil domains and a comparison with PSSM-based predictions. Bioinformatics. 2002, 18 (4): 617-625. 10.1093/bioinformatics/18.4.617.
Wolf E, Kim PS, Berger B: MultiCoil: a program for predicting two-and three-stranded coiled coils. Protein Sci. 1997, 6 (6): 1179-1189. 10.1002/pro.5560060606.
Krogh A, Larsson B, Von Heijne G, Sonnhammer ELL: Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol. 2001, 305 (3): 567-580. 10.1006/jmbi.2000.4315.
Freudenberg J, Propping P: A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002, 18: S110-S115. 10.1093/bioinformatics/18.suppl_2.S110.
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res. 2002, 30 (1): 276-280. 10.1093/nar/30.1.276.
Eddy S: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-763. 10.1093/bioinformatics/14.9.755.
Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988, 85 (8): 2444-2448. 10.1073/pnas.85.8.2444.
Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using protein-protein interactions. J Med Genet. 2006, 43 (8): 691-698. 10.1136/jmg.2006.041376.
Jimenez-Sanchez G, Childs B, Valle D: Human disease genes. Nature. 2001, 409 (6822): 853-855. 10.1038/35057050.
George RA, Heringa J: Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins. 2002, 48 (4): 672-681. 10.1002/prot.10175.
Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D: GeneCards: a novel functional genomics compendium with automated data mining and query reformulation support. Bioinformatics. 1998, 14 (8): 656-664. 10.1093/bioinformatics/14.8.656.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The human genome browser at UCSC. Genome Res. 2002, 12 (6): 996-1006.
Raychaudhuri S, Plenge RM, Rossin EJ, Ng ACY, Purcell SM, Sklar P, Scolnick EM, Xavier RJ, Altshuler D, Daly MJ, et al: Identifying relationships among genomic disease regions: predicting genes at pathogenic SNP associations and rare deletions. PLoS Genet. 2009, 5 (6): e1000534-10.1371/journal.pgen.1000534.
Sparrow DB, Guillén-Navarro E, Fatkin D, Dunwoodie SL: Mutation of hairy-and-enhancer-of-split-7 in humans causes spondylocostal dysostosis. Hum Mol Genet. 2008, 17 (23): 3761-3766. 10.1093/hmg/ddn272.
Sparrow DB, Sillence D, Wouters MA, Turnpenny PD, Dunwoodie SL: Two novel missense mutations in hairy-and-enhancer-of-split-7 in a family with spondylocostal dysostosis. Eur J Hum Genet. 2010, 18 (6): 674-679. 10.1038/ejhg.2009.241.
Dietterich T: Overfitting and undercomputing in machine learning. ACM computing surveys (CSUR). 1995, 27 (3): 326-327. 10.1145/212094.212114.
Ehret GB, Munroe PB, Rice KM, Bochud M, Johnson AD, Chasman DI, Smith AV, Tobin MD, Verwoert GC, Hwang S-J: Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature. 2011, 478 (7367): 103-109. 10.1038/nature10405.
Levy D, Ehret GB, Rice K, Verwoert GC, Launer LJ, Dehghan A, Glazer NL, Morrison AC, Johnson AD, Aspelund T, et al: Genome-wide association study of blood pressure and hypertension. Nat Genet. 2009, 41 (6): 677-687. 10.1038/ng.384.
Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, Najjar SS, Zhao JH, Heath SC, Eyheramendy S: Genome-wide association study identifies eight loci associated with blood pressure. Nat Genet. 2009, 41 (6): 666-676. 10.1038/ng.361.
Torielli L, Tivodar S, Montella RC, Iacone R, Padoani G, Tarsini P, Russo O, Sarnataro D, Strazzullo P, Ferrari P, et al: α-Adducin mutations increase Na/K pump activity in renal cells by affecting constitutive endocytosis: implications for tubular Na reabsorption. Am J Physiol Renal Physiol. 2008, 295 (2): F478-F487. 10.1152/ajprenal.90226.2008.
Fyhrquist F, Saijonmaa O: Renin-angiotensin system revisited. J Intern Med. 2008, 264 (3): 224-236. 10.1111/j.1365-2796.2008.01981.x.
Carey RM, Siragy HM: Newly recognized components of the renin-angiotensin system: potential roles in cardiovascular and renal regulation. Endocr Rev. 2003, 24 (3): 261-271.
Grossman E: Does increased oxidative stress cause hypertension?. Diabetes Care. 2008, 31 (Supplement 2): S185-S189. 10.2337/dc08-s246.
Guijarro C, Egido J: Transcription factor-kappa B (NF-kappa B) and renal disease. Kidney Int. 2001, 59 (2): 415-424. 10.1046/j.1523-1755.2001.059002415.x.
He L, Wang B, Hay EB, Nebert DW: Discovery of ZIP transporters that participate in cadmium damage to testis and kidney. Toxicol Appl Pharmacol. 2009, 238 (3): 250-257. 10.1016/j.taap.2009.02.017.
Liu Z, Li H, Soleimani M, Girijashanker K, Reed JM, He L, Dalton TP, Nebert DW: Cd 2+ versus Zn 2+ uptake by the ZIP8-dependent symporter: kinetics, electrogenicity and trafficking. Biochem Biophys Res Commun. 2008, 365 (4): 814-820. 10.1016/j.bbrc.2007.11.067.
Liu X, Yu X, Zack D, Zhu H, Qian J: TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinforma. 2008, 9 (1): 271-10.1186/1471-2105-9-271.
Edgar R, Domrachev M, Lash AE: Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30 (1): 207-210. 10.1093/nar/30.1.207.
Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M: ArrayExpress-a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007, 35 (suppl 1): D747-D750.
Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28 (1): 45-48. 10.1093/nar/28.1.45.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN: The COG database: an updated version includes eukaryotes. BMC Bioinforma. 2003, 4 (1): 41-10.1186/1471-2105-4-41.
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 2004, 32 (suppl_1): D262-D266.
Consortium EP: A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol. 2011, 9 (4): e1001046-10.1371/journal.pbio.1001046.
Meola N, Gennarino V, Banfi S: MicroRNAs and genetic diseases. PathoGenetics. 2009, 2 (1): 7-10.1186/1755-8417-2-7.
Kleinjan DJ, Coutinho P: Cis-ruption mechanisms: disruption of cis-regulatory control as a cause of human genetic disease. Brief Funct Genomic Proteomic. 2009, 8 (4): 317-332. 10.1093/bfgp/elp022.
The authors would like to thank industrial training students who contributed to the project including Rosemary Sadsad (database), Ignatius Pang (annotations) and Boer Xu (Gviewer implementation).
The authors declare that they have no competing interest.
SB updated and validated V2.0. JYL developed the database, made the webpages and wrote the database queries. NB supervised construction of the database. AL set-up the webserver. RAG and MAW supervised construction of the website. SB, MO and MAW wrote the manuscript. MO worked on the CMP predictions. BG, DF and MB participated in the design of the study. MAW conceived the study, participated in its design and reviewed the results from the data analysis. All authors read and approved the final manuscript.
About this article
Cite this article
Ballouz, S., Liu, J.Y., George, R.A. et al. GentrepidV2.0: a web server for candidate disease gene prediction. BMC Bioinformatics 14, 249 (2013). https://doi.org/10.1186/1471-2105-14-249
- Candidate disease gene prediction
- Candidate disease genes
- Mendelian diseases
- Complex diseases
- Genome-wide association studies
- Candidate gene identification
- Genetic-association studies