Prediction of gene-phenotype associations in humans, mice, and plants using phenologs
- John O Woods†1,
- Ulf Martin Singh-Blom†1, 2, 3,
- Jon M Laurent1,
- Kriston L McGary1, 4 and
- Edward M Marcotte1Email author
© Woods et al.; licensee BioMed Central Ltd. 2013
Received: 14 November 2012
Accepted: 24 May 2013
Published: 21 June 2013
Phenotypes and diseases may be related to seemingly dissimilar phenotypes in other species by means of the orthology of underlying genes. Such “orthologous phenotypes,” or “phenologs,” are examples of deep homology, and may be used to predict additional candidate disease genes.
In this work, we develop an unsupervised algorithm for ranking phenolog-based candidate disease genes through the integration of predictions from the k nearest neighbor phenologs, comparing classifiers and weighting functions by cross-validation. We also improve upon the original method by extending the theory to paralogous phenotypes. Our algorithm makes use of additional phenotype data — from chicken, zebrafish, and E. coli, as well as new datasets for C. elegans — establishing that several types of annotations may be treated as phenotypes. We demonstrate the use of our algorithm to predict novel candidate genes for human atrial fibrillation (such as HRH2, ATP4A, ATP4B, and HOPX) and epilepsy (e.g., PAX6 and NKX2-1). We suggest gene candidates for pharmacologically-induced seizures in mouse, solely based on orthologous phenotypes from E. coli. We also explore the prediction of plant gene-phenotype associations, as for the Arabidopsis response to vernalization phenotype.
We are able to rank gene predictions for a significant portion of the diseases in the Online Mendelian Inheritance in Man database. Additionally, our method suggests candidate genes for mammalian seizures based only on bacterial phenotypes and gene orthology. We demonstrate that phenotype information may come from diverse sources, including drug sensitivities, gene ontology biological processes, and in situ hybridization annotations. Finally, we offer testable candidates for a variety of human diseases, plant traits, and other classes of phenotypes across a wide array of species.
Computational prediction of complex phenotypes from underlying genes has largely involved increasingly complex in silico simulations of cells and cellular processes. Last year, for example, Karr et al. published a whole-cell computational model made up of twenty-eight submodels, each a simulation of a specific cellular process . Most methods are variations on flux-balance analysis for predicting metabolic phenotypes , in most cases including transcriptional regulatory information [3-6], and yield primarily quantitative data.
In contrast, a number of qualitative methods make use of guilt-by-association in functional networks to predict gene-phenotype associations (reviewed in [7, 8]). Like the quantitative methods, these network-based techniques are species-specific, though they may incorporate data from additional species. While quantitative methods are limited to unicellular organisms, or at least to unicellular phenotypes of multi-cellular organisms, the qualitative methods can provide insight into whole-organism traits.
Phenologs are a natural extension of the concept of deep homology: as a bird’s wing and a human hand arose from a common ancestor structure with a common complement of genes and a similar developmental program , so also might less obviously related phenotypes derive from a common ancestor phenotype affiliated with an underlying conserved gene module. To take the above example of Waardenburg syndrome, certain mammalian neural crest defects and plant gravitropism defects share and partly arise from an ancient, highly conserved vesicle trafficking system.
We set out to improve upon the original phenolog algorithm, which relies on identifying pairs of matching phenotypes across species, with a goal of ranking candidate genes relevant to specific traits and diseases by way of an unsupervised search for similar phenotypes (Figure 1). We reasoned that gene-phenotype association predictions coming from multiple “nearby” (or high similarity) phenologs, preferably across multiple species, should provide more predictive power than those from single phenologs. Our method ranks candidate genes based on both the number and similarity of cognate phenotypes which involve those genes, which might be used as a prioritization for wet lab experiments (Figure 1C).
Additionally, we expanded upon the original phenolog study — which included gene-phenotype data from human, mouse, worm (C. elegans), baker’s yeast, and Arabidopsis thaliana — by adding data from chicken, zebrafish, and even E. coli, as well as additional human and worm datasets. We show that phenotype data may come from a variety of sources, including GO biological processes and gene tissue expression annotations, and that the integration of signal from multiple phenologs markedly improves the predictive power of the method.
A key advantage to a neighborhood-based approach for predicting gene-phenotype associations is the ease with which non-obvious — and thus interesting — biological stories may be teased out. We demonstrate the process with epilepsy, a human syndrome; mouse susceptibility to pharmacologically-induced seizures, a related phenotype, using only E. coli data; and atrial fibrillation, the leading cause of arrhythmia in humans.
In addition to offering concrete predictions, we compared two classifiers for integrating phenologs (additive and naïve Bayes), across a variety of similarity or distance functions, and with different numbers of neighboring phenotypes (k). We also experimented with changing the weighting function used to assign prediction scores, and we tested two frameworks for translating gene-phenotype associations between species, evaluating all of these methods within a consistent cross-validation scheme.
Results and discussion
A matrix-based formalism for comparing gene-phenotype associations between species
The phenolog approach, developed by McGary et al. in , identifies pairs of homologous phenotypes in different organisms by counting the overlap between the sets of genes associated with them. McGary et al. hypothesize that pairs of phenotypes with a greater than expected number of shared genes derive from a shared evolutionary past, and further hypothesize that genes associated with one might therefore be good candidates for the other. To extend this conceptual framework to make predictions based on multiple phenotypes from multiple species, we developed a matrix-based formalism for integrating phenotypic information.
For a given species, the set of all gene-phenotype associations can be thought of as a matrix where rows correspond to genes and columns to phenotypes, and the matrix has a 1 in position i,j if the i-th gene has been observed to be associated with the j-th phenotype. This formalism works well when all the genes and phenotypes studied are from the same organism, but leads to some complications when extended to pairs or groups of species.
In particular, since generating these gene-phenotype matrices involves translation via gene orthology, we investigated whether expansions and contractions of gene families (e.g., in Arabidopsis, which frequently has large paralogous gene expansions relative to other eukaryotes) might produce enough noise to obscure signal from other gene-phenotype associations of interest.
We used the INPARANOID algorithm  to determine which genes in different organisms are orthologs of each other. The INPARANOID algorithm discovers orthology relationships in the form of orthogroups (Figure 2A).
For the method described above, we simply translated other species’ gene-phenotype associations into the target (e.g., into human genes when predicting human gene-disease associations) gene-phenotype matrix by orthogroup, and compared the phenotype columns in terms of human genes (as in Figure 2B-C).
This gene-based approach works very well for closely-related species, where genes often have one-to-one equivalents between species. However, when large orthogroups are involved, the predictive performance of this approach deteriorates.
Notably, the use of orthogroups can dramatically simplify the relationships. Consider orthogroup O A (Figure 2A): one gene from each species is involved in the phenolog; whereas in Figure 2B, a matrix with each row representing a human gene, a single mouse gene-phenotype association translates into three human gene associations because of the paralogous expansion of this gene family in humans. Similarly, in Figure 2C, in which each row is a mouse gene, a single human gene-phenotype association translates into two mouse gene associations, due to a mouse-specific paralogous expansion. In contrast, the orthogroup-based matrix in Figure 2D permits a symmetric comparison of the phenotypes, reducing paralogs in each species to a single orthogroup. Furthermore, a hypergeometric CDF test of the intersection between phenotypes ϕ h and ϕ m will produce different values for the matrices described in Figures 2B-C. The consequence of asymmetric distances is that ϕ h may have ϕ m as its closest neighbor when the search is performed in one direction, but ϕ m may not have ϕ h as its closest neighbor in the reverse search.
The orthogroup-based matrix (Figure 2D) has the advantage of producing consistent, symmetric similarity scores regardless of the direction of prediction; furthermore, these scores are not inflated by the co-occurrence of multiple phenotype observations in a single orthogroup. Unless otherwise noted, we use this framework for the analyses that follow.
Integrating information from multiple phenologs
Given this basic formalism — a matrix of gene-disease associations incorporating phenotypic data from multiple species — we next evaluated methods for ranking genes on the basis of their tendency to be involved in a phenotype of interest. In other words, we wanted to construct a set of predictions X for gene-phenotype associations such that X i j is higher for pairs where the gene i is actually associated with the phenotype j.
One way to incorporate information from multiple phenotypes is by measuring the similarity — in terms of associated genes — between pairs of phenotypes, and integrating the information from different phenotypes in such a way such that more similar phenotypes get more weight than less similar phenotypes. We tested two different ways of integrating this information — one multiplicative naïve Bayes scheme, and one additive method.
We tested a wide variety of measures for the weighting function w j l that calculates a similarity or distance between two sets. Pearson sample correlation is a particularly popular option for expert recommendation systems, such as those used in online retail for generating recommendations from past purchase history. McGary et al. used the hypergeometric CDF, which gives the probability of seeing an intersection of size v or greater between phenotypes containing m and n genetic elements, with N total elements in the species pair (Figure 1A-B).
For f i j l we used v/n, the fraction of the number of genes common to both phenotypes j and l over the number of genes known to be involved in phenotype j, which empirically appears to be a good approximation of the probability that a candidate gene from a single phenolog will turn out to be a true positive .
where Φ is a phenotype matrix and w is a weight matrix of phenotype-phenotype similarity scores.
We also repeated the analysis while varying the distance function (used for searching) and holding the recommendation function (w) the same, and vice-versa (Figure 3B). Pearson sample correlation showed the best performance among the distance functions; however, we found that the hypergeometric CDF was the best weighting function for assigning prediction scores to genes.
Some phenotypes were intrinsically unpredictable; notably, several of these were revealed to be combinations of unrelated diseases that were overcollapsed into the same entity in the initial version of our OMIM database (two such examples were achromatopsia with achondroplasia, and the combination of all blood type genes), thus serving inadvertantly as negative controls.
The best similarity functions produced highly correlated predictions. Further, the best predictions of the worst classifiers were highly correlated with the best predictions of the top-performing classifiers. We thus concluded that the potential benefits of a fusion or blending classifier, a model that draws the best characteristics from simple classifiers via optimization, would be modest at best. At worst, any improvement would be difficult to measure; optimization of such a blending model would require an additional layer of cross-validation, and many phenotypes would need to be dropped due to relatively small associated gene sets.
Importantly, we see a strong improvement in predictive performance on actual gene-phenotype associations as compared to randomized data. The method is able to recover all known genes for several real diseases — but is unable to recover withheld genes for any of the randomized diseases.
In addition to evaluating our method’s overall performance, we wished to take a closer look at its prediction of individual diseases in our database. We chose epilepsy because, despite offering ostensibly correct predictions, it actually scores somewhat poorly in cross-validation. In our initial three-fold leave-one-out test, only one of the three separately withheld genes was recovered at a reasonably testable rank (twelve, in this case).
Top epilepsy predictions include PAX6, PRRX1, and RAX2 (of which PAX6 has been associated with seizures); and PAX3, PAX7, HESX1, and NKX2-1, NKX2-4, NKX2-6, and NKX2-8 (Figure 9). Notably, NKX2-1 is involved in mouse epilepsy , and PAX3 appears in a region linked to the human version of the disease ); neither of these genes were in our database.
Interestingly, these predictions come from the Arabidopsis phenotypes regulation of gene expression by genetic imprinting, cotyledon development, epidermal cell differentiation, and gene silencing by RNA, as well as the yeast phenotype annotation for sensitivity to trichlormethine (nitrogen mustard, or tris(2-chloroethyl)amine).
To learn more about the general predictability of the epilepsy phenotype, we ran an expanded cross-validation, withholding each of the full set of 51 epilepsy genes in our database, and found that six genes could be predicted back — all within the top 120 ranks. We note that even when a phenotype performs poorly in cross-validation, it seems that our method still provides useful predictions.
We wanted to know the extent to which predictions could be attributed to paralogy (shared orthogroup membership) with genes already associated in our database with epilepsy. GABBR1 and GABBR2 are each singleton orthogroup members, and are thus independently predicted. KCNA1 and KCNA2 emerged as paralogs following the human-worm divergence, but are predicted from non-worm phenotypes — and are therefore also independent predictions.
PAX6’s plant-human paralogs make up the top three rank bins in Figure 9. We suggest that even non-independent predictions are of use, provided they are accompanied by independent predictions — since, as mentioned, PAX3, NKX2-1, and PAX6 are all associated to some degree with seizures and/or epilepsy. Indeed, the inclusion of species in which these genes are not paralogs offers additional resolution on predictions and demonstrates the utility of our method.
Predicting from E. coli— Pharmacologically-induced seizures
Next, we examined the predictions for promising new candidate genes. One of the most intriguing candidates was α-adducin, which is known to be reduced in the brains of rats experiencing kainate-induced seizures .
Another interesting prediction is Sv2a (synaptic vesicle glycoprotein). It was recently reported that a mutation in chicken SV2A leads to photosensitive reflex epilepsy . Mouse Sv2a is a known binding site for levetiracetam, an antiepileptic drug , and Sv2a−/− mice experience seizures and die within three weeks of birth [19, 20].
We also examined the compounds associated with the source E. coli phenotypes to see if these were associated with seizures. The compounds included ethanol — alcohol poisoning and alcohol withdrawal symptoms include seizures — as well as paraquat, which causes seizures and brain damage in rats ; and aztreonam, which is a convulsant . While many compounds might cause seizures if given in sufficient amounts, a control PubMed search for ten randomly chosen compounds associated with E. coli phenotypes in our database failed to turn up such clear associations.
Thus, both at the level of predicting candidate genes and affiliated compounds, the E. coli phenotypes appear to be relevant. Both of the genes discussed above may indeed represent reasonable new candidates for affecting pharmacologically-induced seizures and might warrant follow-up experiments. Finally, it is particularly striking that mammalian seizures — which are distinctly neurological phenomena — could be derived from processes so fundamental as to exist even in bacteria.
We looked next at the human heart phenotype atrial fibrillation (AF), expecting to find that AF, like pharmacologically-induced seizures, was rooted in highly-conserved signaling defects. Instead, we found that the most predictive phenotypes were from mouse and chicken — quite unlike epilepsy, for which plants, worms, and yeast offered a great deal more information than mouse or chicken.
The AF phenotype performed well cross-validation in the gene-based configuration method. However, in the orthogroup-based cross-validation, only three of the eight genes associated could be predicted after being withheld. The removed genes were predicted at ranks 3-4, 15-16, and 81-94. Nonetheless, the novel predictions for this phenotype are worth noting.
Similarly predicted are ATP4A and, further down the list, ATP4B, which are the α and β subunits of the H+/K+ ATPase. This proton pump is responsible for gastric acid secretion during digestion.
A somewhat speculative connection is offered by recent work, which showed cigarette smoke extracts cause an increase in the amount of H+/K+ ATPase in the stomach . It is unclear — and worth testing — whether ATP4A and ATP4B are expressed in the heart. These genes could offer an additional route by which smoking contributes to heart problems.
Following HRH2 and ATP4A is HOPX, or homeodomain only protein x, which is down-regulated during heart failure in humans . It is not clear that HOPX is involved in AF per se, but again worth exploring in future experiments, as is the gene ranked next, KCNE1, based on orthologous phenotype prolonged QT interval — and seemingly also a factor in rare cases of atrial fibrillation [30-32].
GJA1 (gap junction protein, α1, also known as connexin 43 or Cx43) is one of the two most abundantly expressed connexins in the heart [33-35]. The other is GJA5 (connexin 40), already associated in our database with AF. Cx40 and Cx43 seem to form heteromeric channels with different properties from homomeric channels . Cx43, unlike Cx40, is essential for heart development and cardiac impulse conductance in mice . Tuomi et al. observed that a dominant negative Cx43 mutant causes severe AF . Finally, atrial fibrillation was observed in a somatic mutation in human GJA1. Notably, another channel similarly implicated was SCN5A (human cardiac sodium channel, voltage-gated, type V, α subunit). This sodium channel component has been associated with atrial fibrillation [40-42] but was missing from our database.
In terms of the predictor phenotypes themselves, the top AF phenologs can be grouped into three basic categories: cardiac, gastric, and auditory. We have explored the first two categories, but have not considered genes from the third. We note that while Jervell and Lange-Nielsen syndrome (i.e., long QT syndrome) has been associated with deafness for half a century [43-46] via alleles of KCNQ1 and KCNE1, other genes may yet be involved . Further, Belmont et al. write of “a growing appreciation for conditions that affect hearing and which are accompanied by significant cardiovascular disorders” .
Given the success with which our method was able to predict AF genes — and with which it was able to identify potentially related disorders — exploration of additional candidates (e.g., ATP4A/B, POU4F3, and S1PR2) from Figure 11 may be warranted.
Plant phenotypes — response to vernalization
Finally, having focused primarily on predicting mammalian phenotype- and disease-genes, we asked whether plant gene-phenotype associations could be predicted from the other species in our database.
Plants represented a particular challenge, since a number of factors reduce the specificity of predictions for plant phenotypes. Firstly, while human phenotypes are predicted at least in part from other mammals and even other vertebrates — which are phylogenetically similar — there are no close neighbor species to Arabidopsis in our database.
Secondly, while 19,439 of the 28,002 human genes in our database have orthologs in other species, the ratio is less promising for A. thaliana phenolog predictions: 12,668 of 27,325 have orthologs. The cause is likely again the lack of other plants in our database, compared to the several vertebrates from which to draw information for H. sapiens.
Third and finally, the Arabidopsis genome contains a great deal of redundancy, as observed in : 37.4% of proteins belong to families of more than five members, compared to 12.1% in fruit fly and 24.0% in worm. In predictions that rely on gene orthology, as with phenologs, there is often no way to distinguish which of the plant paralogs is most relevant — except perhaps by relying on paralogous phenotypes.
Our orthogroup-based matrix formalism (Figure 2D) was thus critical for addressing the extensive divergence of gene families between distant species. In particular and as previously described, when attempting to predict Arabidopsis phenotypes, we noticed that the gene-based formalism resulted in asymmetric scores and unwarranted improvements in rank of certain predictions, particularly those where large orthogroups were involved (Figure 2A-C). The genes-as-rows configuration also inflated performance, as measured by ROC plots, during cross-validation — primarily due to the high frequency with which plant gene expansions co-participate in a biological process.
We selected this phenotype because it scores better than most other plant phenotypes in cross-validation; seven of the fifteen genes in this plant phenotype can be predicted back at low rank when withheld, representing two or three orthogroups (about half of the total number of orthogroups) depending upon the source species considered.
Although we cannot easily cross-validate predictions from paralogous phenotypes, since they are not sufficiently independent, we speculate that the inclusion of paralogous phenotype data may help to improve the specificity of the predictions at no perceivable cost.
Among the new candidate genes implicated, two are particularly notable. One of these, EMF2 — which appears to be associated with vernalization-mediated flowering by its interaction with CLF — is predicted based on seemingly unrelated orthologous mouse and human phenotypes (abnormal chorion morphology and endometrial cancer, respectively). EMF2 is paralogous with known vernalization gene VRN2; however, it is ranked ahead of VRN2 by its association with the related plant phenotype negative regulation of flower development. That EMF2 was boosted by a potential paralogous phenotype supports the hypothesis that paralogous phenotypes are similarly useful to orthologous phenotypes in predicting gene function.
The second interesting prediction is FWA and its several paralogs (HDG1-4, HDG7-12, PDF2, ANL2, ATML1, AT5G07260, and HB-7). Certain FWA mutants produce a vernalization-insensitivity phenotype [53, 54]. Candidates ANL2 and PDF2 both have late flowering phenotypes [55, 56] markedly similar to that of FWA. That discovery lends additional support for paralogous phenotypes, as neither FWA nor PDF2 were associated in our database with regulation of flower development — but our method successfully identified negative regulation of flower development as a potential phenolog.
Empirically, we believe that the strength of our predictions for plant phenotypes are limited primarily by the quantity of gene-phenotype information available for plants. Notably, the addition of associations from other plant species would prove exceptionally useful for predicting not only Arabidopsis phenotypes, but crop species as well, and — as demonstrated earlier — even animal phenotypes.
We sought to determine whether the improvement in our method over the original Phenologs method could be attributed in part to the additional species information or arose exclusively from incorporating phenotypes beyond the nearest neighbor (as demonstrated in Figure 5).
Next, we tried adding the new species (chicken, E. coli, and zebrafish) individually, and found improvements at relevant ranks for E. coli and D. rerio. Surprisingly, inclusion of chicken data hurt the predictive performance. We also tried adding E. coli and D. rerio at the same time, and saw an additional increase in performance.
Given the decrease in performance resulting from the inclusion of chicken phenotypes, we sought to determine whether the additional C. elegans datasets were negatively affecting performance. In Figure 14B, we tried leaving out the broad and specific components of the green dataset. We found that either component alone performed worse than the combination. We also tried leaving out the green datasets altogether, and found that their inclusion moderately improved performance at relevant ranks, but decreased performance beyond around rank 45.
These mixed results with the datasets are somewhat surprising. We expected that the in situ hybridization expression annotations from GEISHA would be useful for human predictions not only because gene expression stage and location should correlate highly with phenotype, but also because chicken — like mouse — is more closely related to human compared to other species in our database.
The distributions of genes per phenotype for human, mouse, Arabidopsis, and chicken were similar (not pictured). Only E. coli differed substantially, with most phenotypes involving between 800 and 1,000 genes. However, in general, the counts of E. coli-human orthologs involved in bacterial phenotypes are much smaller due to the relatively small fraction of genes with orthologs between the two species.
In summary, we set out to improve upon the results of the original phenolog project by unifying information from a “neighborhood” of phenotypes surrounding the phenotype or disease of interest. Our method produces ranked predictions for a large percentage of human diseases in OMIM, as well as for plant biological process-based phenotypes.
Notably, we were able to demonstrate the correct prediction of at least one gene associated with the mouse phenotype pharmacologically-induced seizures using only phenologs from E. coli. While McGary et al. demonstrated the existence of deep homology between mice and single-celled eukaryotes, our work suggests that examples of deep homology exist — and may even offer useful predictions — between prokaryotes and eukaryotes.
We also demonstrate that the term “phenotype” may be interpreted broadly when incorporating gene-association data for phenolog-based predictions. Gene Ontology biological processes are one potential source. Another potential source is annotations for in situ hybridization experiments, such as GEISHA, but it may be necessary to refine such a phenotype database by hand.
Finally, we give a number of concrete gene predictions for the human diseases atrial fibrillation and epilepsy, and show how phenologs may be used to generate hypotheses and a biological context that correctly connect categories of diseases, such as disorders of the heart, stomach, and sensorineural system.
For the gene-based matrix, we compared classifiers and metrics using n-fold cross-validation, and calculated receiver operating characteristic (ROC) and precision-recall curves for each disease or phenotype to be predicted. Classifiers could be represented by arrays of area-under-the-curve measurements.
With the orthogroup-based matrix we chose a simpler and faster “leave-one-out” cross-validation scheme, where one observed gene association was hidden for each disease. Noting that some orthogroups have multiple genes associated with the same phenotype, we also hid any orthogroups associated with hidden genes. Since a gene may be part of one orthogroup for each species included in the search, we measured the rank of predicted genes rather than predicted orthogroups. When multiple genes were predicted with the same score, the mean rank was used.
The leave-one-out procedure was repeated three times for each phenotype, taking the median hidden gene rank to be representative of the classifier-phenotype performance.
Additional phenotype data
In addition to those databases described in , we incorporated orthology and gene-phenotype data from a variety of additional species. We excluded any phenotypes with fewer than three associated genes.
Our choice of species was driven primarily by availability of data in a useful format, namely that phenotype annotations could be expressed qualitatively, and that we we could link those annotations to a protein sequence; for example, we wished to incorporate phenotype data from the agricultural plant database Gramene, but most or all phenotype-associated genes lacked sequences.
Human diseases came from OMIM as for , but updated on August 17, 2011. Additional C. elegans phenotypes came from Green et al.  and were broken down into two datasets, green-broad and green-specific, according to the classifications given by the authors in the second supplemental table, “Broad Phenotypic Category” and the more specific subcategories into which they were divided. We chose to maintain the dual categorization primarily because a number of the more specific phenotypic classes were monogenic, and hence would have been ignored altogether in our analysis.
E. coli phenotypes were taken on May 20, 2011, from the file ‘coli_FinalData2.txt’ . Each gene’s phenomic profile was sorted by score, assigning both the top and bottom forty conditions to the gene. Thus, each condition was considered to be a phenotype, and the genes associated with that phenotype were those genes whose growth was most affected — either positively or negatively — in the corresponding condition.
We considered fruit fly phenotypes from FlyBase . Unfortunately, FlyBase’s dataset — while extensive — makes use of a controlled vocabulary optimized for manual searching rather than high-throughput analysis. The only way to connect a phenotypic class annotation to an anatomical location or developmental stage is by allele and literature reference — if these are given at all. We attempted to match anatomical annotations for mutant phenotypes to annotations from the phenotypic class ontology, joining on allele and publication. While it was possible to predict some human diseases based on fruit fly phenotypes from FlyBase, the results were noisy and difficult to interpret, and we ultimately chose to exclude fruit fly results.
Zebrafish phenotypes consisted of gene ontology (GO) biological processes from ZFIN , keeping only those annotations with evidence types of IMP, IDA, IPI, IGI, TAS, NAS, IC, and IEP — the same procedure used for Arabidopsis phenotypes, obtained from TAIR . These evidence types were selected so as to avoid the inclusion of annotations that originated directly from knowledge of other model organisms.
For chicken (Gallus gallus) phenotypes, we utilized in situ hybridization annotations from GEISHA , kindly provided in XML format on June 24, 2011. If there were more than fifty genes associated with a specific location and more than three at a specific state at that location, a new phenotype was created (“anatomical location at stage x”); and regardless, each location became an independent phenotype. We defined phenotypes as gene-expression associations in specific anatomical locations. For those locations with more than fifty genes annotated, we created additional phenotypes for each stage with greater than three associated genes.
Availability of supporting data
The data sets supporting the results of this article are available at http://www.phenologs.org/knn/.
This work was supported by grants from the Texas Advanced Research Program, the National Science Foundation, the National Institutes of Health, U.S. Army Research (58343-MA), and the Welch Foundation (F-1515); a Packard Fellowship (to E.M.M.); and a National Science Foundation Graduate Research Fellowship (to J.O.W.).
- Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, et al: A whole-cell computational model predicts phenotype from genotype. Cell. 2012, 150 (2): 389-401. 10.1016/j.cell.2012.05.044.PubMed CentralView ArticlePubMedGoogle Scholar
- Varma A, Palsson BO: Metabolic flux balancing: basic concepts, scientific and practical use. Nat Biotechnol. 1994, 12: 994-998. 10.1038/nbt1094-994.View ArticleGoogle Scholar
- Covert MW, Schilling CH, Palsson B: Regulation of gene expression in flux balance models of metabolism. J Theor Biol. 2001, 213: 73-88. 10.1006/jtbi.2001.2405. [http://www.ncbi.nlm.nih.gov/pubmed/11708855]View ArticlePubMedGoogle Scholar
- Covert M, Knight E, Reed J, Herrgard M, Palsson B: Integrating high-throughput and computational data elucidates bacterial networks. Nature. 2004, 429 (May): 92-96. [http://www.nature.com/nature/journal/v429/n6987/abs/nature02456.html]View ArticlePubMedGoogle Scholar
- Covert MW, Xiao N, Chen TJ, Karr JR: Integrating metabolic, transcriptional regulatory and signal transduction models in Escherichia coli. Bioinformatics (Oxford, England). 2008, 24 (18): 2044-2050. 10.1093/bioinformatics/btn352. [http://www.ncbi.nlm.nih.gov/pubmed/18621757]View ArticleGoogle Scholar
- Chandrasekaran S, Price ND: Probabilistic integrative modeling of genome-scale metabolic and regulatory networks in Escherichia coli and Mycobacterium tuberculosis. Proc Natl Acad Sci U S A. 2010, 107 (41): 17845-17850. 10.1073/pnas.1005139107. [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2955152]PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Dickson SP, Stolle Ca, Krantz IDetal: Interpretation of association signals and identification of causal variants from genome-wide association studies. Am J Hum Genet. 2010, 86 (5): 730-742. 10.1016/j.ajhg.2010.04.003.PubMed CentralView ArticlePubMedGoogle Scholar
- Greene CS, Troyanskaya OG: Chapter 2: Data-driven view of disease biology. PLoS Comput Biol. 2012, 8 (12): 002816-[http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3531282&tool=pmcentrez&rendertype=abstract]View ArticleGoogle Scholar
- McGary KL, Park TJ, Woods JO, Cha HJ, et al: Systematic discovery of nonobvious human disease models through orthologous phenotypes. Proc Natl Acad Sci U S A. 2010, 107 (14): 6544-6549. 10.1073/pnas.0910200107.PubMed CentralView ArticlePubMedGoogle Scholar
- Shubin N, Tabin C, Carroll S: Fossils, genes and the evolution of animal limbs. Nature. 1997, 388 (6643): 639-648. 10.1038/41710.View ArticlePubMedGoogle Scholar
- Ostlund G, Schmitt T, Forslund K, Köstler T, et al: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, 38 (Database issue): D196—D203-PubMed CentralPubMedGoogle Scholar
- Wang X, Sun W, Zhu X, Li L, et al: Association between the gamma-aminobutyric acid type B receptor 1 and 2 gene polymorphisms and mesial temporal lobe epilepsy in a Han Chinese population. Epilepsy Res. 2008, 81 (2-3): 198-203.View ArticlePubMedGoogle Scholar
- Glasscock E, Yoo JW, Chen TT, Klassen TL, Noebels JL: Kv1.1 potassium channel deficiency reveals brain-driven cardiac dysfunction as a candidate mechanism for sudden unexplained death in epilepsy. J Neurosci. 2010, 30 (15): 5167-5175. 10.1523/JNEUROSCI.5591-09.2010.PubMed CentralView ArticlePubMedGoogle Scholar
- Butt SJB, Sousa VH, Fuccillo MV, Hjerling-Leffler J, et al: The requirement of Nkx2-1 in the temporal specification of cortical interneuron subtypes. Neuron. 2008, 59 (5): 722-732. 10.1016/j.neuron.2008.07.031.PubMed CentralView ArticlePubMedGoogle Scholar
- Ratnapriya R, Vijai J, Kadandale JS, et al: A locus for juvenile myoclonic epilepsy maps to 2q33-q36. Hum Genet. 2010, 128 (2): 123-130. 10.1007/s00439-010-0831-6.View ArticlePubMedGoogle Scholar
- Wyneken U, Smalla KH, Marengo JJ, Soto D, et al: Kainate-induced seizures alter protein composition and N-methyl-D-aspartate receptor function of rat forebrain postsynaptic densities. Neuroscience. 2001, 102: 65-74. 10.1016/S0306-4522(00)00469-3.View ArticlePubMedGoogle Scholar
- Douaud M, Feve K, Pituello F, Gourichon D, et al: Epilepsy caused by an abnormal alternative splicing with dosage effect of the SV2A gene in a chicken model. PloS one. 2011, 6 (10): e26932-10.1371/journal.pone.0026932.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaminski RM, Gillard M, Leclercq K, Hanon E, et al: Proepileptic phenotype of SV2A-deficient mice is associated with reduced anticonvulsant efficacy of levetiracetam. Epilepsia. 2009, 50 (7): 1729-1740. 10.1111/j.1528-1167.2009.02089.x.View ArticlePubMedGoogle Scholar
- Janz R, Goda Y, Geppert M, Missler M, Südhof TC: SV2A and SV2B function as redundant Ca2+ regulators in neurotransmitter release. Neuron. 1999, 24 (4): 1003-1016. 10.1016/S0896-6273(00)81046-6.View ArticlePubMedGoogle Scholar
- Crowder KM, Gunther JM, Jones Ta, et al: Abnormal neurotransmission in mice lacking synaptic vesicle protein 2A (SV2A). Proc Natl Acad Sci U S A. 1999, 96 (26): 15268-15273. 10.1073/pnas.96.26.15268.PubMed CentralView ArticlePubMedGoogle Scholar
- Bagetta G, Corasaniti MT, Iannone M, Nisticò G, Stephenson JD: Production of limbic motor seizures and brain damage by systemic and intracerebral injections of paraquat in rats. Pharmacol Toxicol. 1992, 71 (6): 443-448. 10.1111/j.1600-0773.1992.tb00575.x.View ArticlePubMedGoogle Scholar
- De Sarro A, Grasso S, De Sarro GB, Ammendola D: Relationship between structure and convulsant properties of some beta-lactam antibiotics following intracerebroventricular microinjection in rats. Antimicrob Agents Chemother. 1995, 39: 232-237. 10.1128/AAC.39.1.232.PubMed CentralView ArticlePubMedGoogle Scholar
- Barger G, Dale HH: Chemical structure and sympathomimetic action of amines. J Physiol. 1910, 41 (1-2): 19-59.PubMed CentralView ArticlePubMedGoogle Scholar
- Skovgaard N, Møller K, Gesser H, Wang T: Histamine induces postprandial tachycardia through a direct effect on cardiac H2-receptors in pythons. Am J Physiol Regul Integr Comp Physiol. 2009, 296 (3): R774—R785-PubMedGoogle Scholar
- Shigenobu K, Tatsuno H, Matsuki N, Oshima T, Kasuya Y: Electrophysiological and mechanical studies on the cardiac effects of a histamine H2 receptor antagonist, cimetidine, in the isolated guinea pig myocardium. J Pharm Dyn. 1979, 2: 141-150. 10.1248/bpb1978.2.141.View ArticleGoogle Scholar
- Levi R, Malm JR, Bowman FO, Rosen MR: The arrhythmogenic actions of histamine on human atrial fibers. Circ Res. 1981, 49 (2): 545-550. 10.1161/01.RES.49.2.545.View ArticlePubMedGoogle Scholar
- He G, Hu J, Li T, Ma X, et al: The arrhythmogenic effect of sympathetic histamine in mouse hearts subjected to acute ischemia. Mol Med. 2011, 18 (1): 1-9.PubMed CentralView ArticleGoogle Scholar
- Hammadi M, Adi M, John R, Khoder GAK, Karam SM: Dysregulation of gastric H,K-ATPase by cigarette smoke extract. World J Gastroenterol. 2009, 15 (32): 4016-4022. 10.3748/wjg.15.4016.PubMed CentralView ArticlePubMedGoogle Scholar
- Trivedi CM, Cappola TP, Margulies KB, Epstein Ja: Homeodomain only protein x is down-regulated in human heart failure. J Mol Cell Cardiol. 2011, 50 (6): 1056-1058. 10.1016/j.yjmcc.2011.02.015.PubMed CentralView ArticlePubMedGoogle Scholar
- Ellinor PT, Petrov-Kondratov VI, Zakharova E, Nam EG, MacRae Ca: Potassium channel gene mutations rarely cause atrial fibrillation. BMC Med Genet. 2006, 7: 70-10.1186/1471-2350-7-70.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu LX, Yang WY, Zhang HQ, Tao ZH, Duan CC: [Study on the correlation between CETP TaqIB, KCNE1 S38G and eNOS T-786C gene polymorphisms for predisposition and non-valvular atrial fibrillation]. Zhonghua Liu Xing Bing Xue Za Zhi. 2008, 29 (5): 486-492.PubMedGoogle Scholar
- Yao J, Ma Yt, Xie X, Liu F, et al: [Association of rs1805127 polymorphism of KCNE1 gene with atrial fibrillation in Uigur population of Xinjiang]. Zhonghua Yi Xue Yi Chuan Xue Za Zhi. 2011, 28 (4): 436-440.PubMedGoogle Scholar
- Beyer EC, Paul DL, Goodenough Da: Connexin43: a protein from rat heart homologous to a gap junction protein from liver. J Cell Biol. 1987, 105 (6 Pt 1): 2621-2629.View ArticlePubMedGoogle Scholar
- Delorme B, Dahl E, Jarry-Guichard T, Marics I, et al: Developmental regulation of connexin 40 gene expression in mouse heart correlates with the differentiation of the conduction system. Dev Dyn. 1995, 204 (4): 358-371. 10.1002/aja.1002040403.View ArticlePubMedGoogle Scholar
- Lin X, Gemel J, Glass A, Zemlin CW, et al: Connexin40 and connexin43 determine gating properties of atrial gap junction channels. J Mol Cell Cardiol. 2010, 48: 238-245. 10.1016/j.yjmcc.2009.05.014.PubMed CentralView ArticlePubMedGoogle Scholar
- Cottrell GT, Burt JM: Heterotypic gap junction channel formation between heteromeric and homomeric Cx40 and Cx43 connexons. Am J Physiol Cell Physiol. 2001, 281 (5): C1559-C1567.PubMedGoogle Scholar
- Reaume AG, de Sousa PA, Kulkarni S, Langille BL, et al: Cardiac malformation in neonatal mice lacking connexin43. Sci (New York, NY). 1995, 267 (5205): 1831-1834. 10.1126/science.7892609.View ArticleGoogle Scholar
- Tuomi JM, Tyml K, Jones DL: Atrial tachycardia/fibrillation in the connexin 43 G60S mutant (Oculodentodigital dysplasia) mouse. Am J Physiol Heart Circ Physiol. 2011, 300 (4): H1402—H1411-View ArticlePubMedGoogle Scholar
- Thibodeau IL, Xu J, Li Q, Liu G, et al: Paradigm of genetic mosaicism and lone atrial fibrillation: physiological characterization of a connexin 43-deletion mutant identified from atrial tissue. Circulation. 2010, 122 (3): 236-244. 10.1161/CIRCULATIONAHA.110.961227.View ArticlePubMedGoogle Scholar
- Laitinen-Forsblom PJ, Mäkynen P, Mäkynen H, Yli-Mäyry S, et al: SCN5A mutation associated with cardiac conduction defect and atrial arrhythmias. J Cardiovasc Electrophysiol. 2006, 17 (5): 480-485. 10.1111/j.1540-8167.2006.00411.x.View ArticlePubMedGoogle Scholar
- Darbar D, Kannankeril PJ, Donahue BS, Kucera G, et al: Cardiac sodium channel (SCN5A) variants associated with atrial fibrillation. Circulation. 2008, 117 (15): 1927-1935. 10.1161/CIRCULATIONAHA.107.757955.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen L, Zhang W, Fang C, Jiang S, et al: Polymorphism H558R in the human cardiac sodium channel SCN5A gene is associated with atrial fibrillation. J Int Med Res. 2011, 39 (5): 1908-1916. 10.1177/147323001103900535.View ArticlePubMedGoogle Scholar
- Fraser GR, Froggatt P, James TN: Congenital deafness associated with electrocardiographic abnormalities, fainting attacks and sudden death. A recessive syndrome. Q J Med. 1964, 33: 361-385.PubMedGoogle Scholar
- Fraser GR, Froggatt P, Murphy T: Genetical aspects of the cardio-auditory syndrome of Jervell and Lange-Nielsen (Congenital deafness and electrocardiographic abnormalities). Ann Hum Genet. 1964, 28: 133-157. 10.1111/j.1469-1809.1964.tb00469.x.View ArticlePubMedGoogle Scholar
- Schwartz PJ: The long QT syndrome. Curr Probl Cardiol. 1997, 22 (6): 297-351. 10.1016/S0146-2806(97)80009-6.View ArticlePubMedGoogle Scholar
- Schwartz PJ, Spazzolini C, Crotti L, Bathen J, et al: The Jervell and Lange-Nielsen syndrome: natural history, molecular basis, and clinical outcome. Circulation. 2006, 113 (6): 783-790. 10.1161/CIRCULATIONAHA.105.592899.View ArticlePubMedGoogle Scholar
- Neyroud N, Tesson F, Denjoy I, Leibovici M, et al: A novel mutation in the potassium channel gene KVLQT1 causes the Jervell and Lange-Nielsen cardioauditory syndrome. Nat Genet. 1997, 15 (2): 186-189.View ArticlePubMedGoogle Scholar
- Schulze-Bahr E, Wang Q, Wedekind H, Haverkamp W, et al: KCNE1 mutations cause Jervell and Lange-Nielsen syndrome. Nat Genet. 1997, 17 (3): 267-268. 10.1038/ng1197-267.View ArticlePubMedGoogle Scholar
- Gritli S, Ben Salah M, Robson CD, Shili A, et al: Association of the long QT syndrome With goiter and deafness. Am J Cardiol. 2010, 105 (5): 681-686. 10.1016/j.amjcard.2009.10.034.View ArticlePubMedGoogle Scholar
- Belmont JW, Craigen W, Martinez H, Jefferies JL: Genetic disorders with both hearing loss and cardiovascular abnormalities. Adv Otorhinolaryngol. 2011, 70: 66-74.PubMedGoogle Scholar
- The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408 (6814): 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Chanvivattana Y, Bishopp A, Schubert D, Stock C, et al: Interaction of Polycomb-group proteins controlling flowering in Arabidopsis. Dev (Cambridge, England). 2004, 131 (21): 5263-5276. 10.1242/dev.01400.View ArticleGoogle Scholar
- Koornneef M, Hanhart CJ, van der Veen JH: A genetic and physiological analysis of late flowering mutants in Arabidopsis thaliana. Mol Gen Genet : MGG. 1991, 229: 57-66.View ArticlePubMedGoogle Scholar
- Genger RK, Peacock WJ, Dennis ES, Finnegan EJ: Opposing effects of reduced DNA methylation on flowering time in Arabidopsis thaliana. Planta. 2003, 216 (3): 461-466.PubMedGoogle Scholar
- Weigel D, Ahn JH, Blázquez MA, Borevitz JO, et al: Activation tagging in Arabidopsis. Plant Physiol. 2000, 122 (4): 1003-1013. 10.1104/pp.122.4.1003.PubMed CentralView ArticlePubMedGoogle Scholar
- Abe M, Katsumata H, Komeda Y, Takahashi T: Regulation of shoot epidermal cell differentiation by a pair of homeodomain proteins in Arabidopsis. Dev (Cambridge, England). 2003, 130 (4): 635-643. 10.1242/dev.00292.View ArticleGoogle Scholar
- Ikeda Y, Kobayashi Y, Yamaguchi A, Abe M, Araki T: Molecular basis of late-flowering phenotype caused by dominant epi-alleles of the FWA locus in Arabidopsis. Plant Cell Physiol. 2007, 48 (2): 205-220.View ArticlePubMedGoogle Scholar
- Green Ra, Kao HL, Audhya A, Arur S, et al: A high-resolution C. elegans essential gene network based on phenotypic profiling of a complex tissue. Cell. 2011, 145 (3): 470-482. 10.1016/j.cell.2011.03.037.PubMed CentralView ArticlePubMedGoogle Scholar
- Nichols RJ, Sen S, Choo YJ, Beltrao P, et al: Phenotypic landscape of a bacterial cell. Cell. 2011, 144: 143-156. 10.1016/j.cell.2010.11.052.PubMed CentralView ArticlePubMedGoogle Scholar
- Tweedie S, Ashburner M, Falls K, et al: FlyBase: enhancing Drosophila Gene Ontology annotations. Nucleic Acids Res. 2009, 37 (Database issue): D555-D559.PubMed CentralView ArticlePubMedGoogle Scholar
- Sprague J, Bayraktaroglu L, Clements D, Conlin T, et al: The Zebrafish Information Network: the zebrafish model organism database. Nucleic Acids Res. 2006, 34 (Database issue): D581-D585.PubMed CentralView ArticlePubMedGoogle Scholar
- Swarbreck D, Wilks C, Lamesch P, Berardini TZ, et al: The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2008, 36 (Database issue): D1009-D1014.PubMed CentralPubMedGoogle Scholar
- Bell GW, Yatskievych Ta, Antin PB: GEISHA, a whole-mount in situ hybridization gene expression screen in chicken embryos. Dev Dyn. 2004, 229 (3): 677-687. 10.1002/dvdy.10503.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.