A matrix-based formalism for comparing gene-phenotype associations between species
The phenolog approach, developed by McGary et al. in [9], identifies pairs of homologous phenotypes in different organisms by counting the overlap between the sets of genes associated with them. McGary et al. hypothesize that pairs of phenotypes with a greater than expected number of shared genes derive from a shared evolutionary past, and further hypothesize that genes associated with one might therefore be good candidates for the other. To extend this conceptual framework to make predictions based on multiple phenotypes from multiple species, we developed a matrix-based formalism for integrating phenotypic information.
For a given species, the set of all gene-phenotype associations can be thought of as a matrix where rows correspond to genes and columns to phenotypes, and the matrix has a 1 in position i,j if the i-th gene has been observed to be associated with the j-th phenotype. This formalism works well when all the genes and phenotypes studied are from the same organism, but leads to some complications when extended to pairs or groups of species.
In particular, since generating these gene-phenotype matrices involves translation via gene orthology, we investigated whether expansions and contractions of gene families (e.g., in Arabidopsis, which frequently has large paralogous gene expansions relative to other eukaryotes) might produce enough noise to obscure signal from other gene-phenotype associations of interest.
In order to address this question, we developed two different frameworks within our matrix formalism for translating gene-phenotype associations between species (Figure 2). In the first, the “gene-based” approach, we let the rows correspond to genes in the species that we wish to make predictions for, and translated the gene-phenotype interactions from a number of species by orthology. This gave us a number of species-specific gene-phenotype association matrices ΦS, for S∈{human, mouse, yeast, nematode, plant, zebrafish, fly, chicken }, where each ΦS is defined by
We used the INPARANOID algorithm [11] to determine which genes in different organisms are orthologs of each other. The INPARANOID algorithm discovers orthology relationships in the form of orthogroups (Figure 2A).
For the method described above, we simply translated other species’ gene-phenotype associations into the target (e.g., into human genes when predicting human gene-disease associations) gene-phenotype matrix by orthogroup, and compared the phenotype columns in terms of human genes (as in Figure 2B-C).
This gene-based approach works very well for closely-related species, where genes often have one-to-one equivalents between species. However, when large orthogroups are involved, the predictive performance of this approach deteriorates.
To mitigate the decrease in performance caused by paralogous gene expansions we devised an “orthogroup-based” matrix framework, in which rows corresponded to INPARANOID orthogroups (Figure 2A and D) rather than actual genes (Figure 2B-C),
Notably, the use of orthogroups can dramatically simplify the relationships. Consider orthogroup O
A
(Figure 2A): one gene from each species is involved in the phenolog; whereas in Figure 2B, a matrix with each row representing a human gene, a single mouse gene-phenotype association translates into three human gene associations because of the paralogous expansion of this gene family in humans. Similarly, in Figure 2C, in which each row is a mouse gene, a single human gene-phenotype association translates into two mouse gene associations, due to a mouse-specific paralogous expansion. In contrast, the orthogroup-based matrix in Figure 2D permits a symmetric comparison of the phenotypes, reducing paralogs in each species to a single orthogroup. Furthermore, a hypergeometric CDF test of the intersection between phenotypes ϕ
h
and ϕ
m
will produce different values for the matrices described in Figures 2B-C. The consequence of asymmetric distances is that ϕ
h
may have ϕ
m
as its closest neighbor when the search is performed in one direction, but ϕ
m
may not have ϕ
h
as its closest neighbor in the reverse search.
The orthogroup-based matrix (Figure 2D) has the advantage of producing consistent, symmetric similarity scores regardless of the direction of prediction; furthermore, these scores are not inflated by the co-occurrence of multiple phenotype observations in a single orthogroup. Unless otherwise noted, we use this framework for the analyses that follow.
Integrating information from multiple phenologs
Given this basic formalism — a matrix of gene-disease associations incorporating phenotypic data from multiple species — we next evaluated methods for ranking genes on the basis of their tendency to be involved in a phenotype of interest. In other words, we wanted to construct a set of predictions X for gene-phenotype associations such that X
i
j
is higher for pairs where the gene i is actually associated with the phenotype j.
One way to incorporate information from multiple phenotypes is by measuring the similarity — in terms of associated genes — between pairs of phenotypes, and integrating the information from different phenotypes in such a way such that more similar phenotypes get more weight than less similar phenotypes. We tested two different ways of integrating this information — one multiplicative naïve Bayes scheme, and one additive method.
The naïve Bayes scheme we used was first described in the original phenolog paper [9], and can be written as follows:
(1)
where
(2)
(3)
We tested a wide variety of measures for the weighting function w
j
l
that calculates a similarity or distance between two sets. Pearson sample correlation is a particularly popular option for expert recommendation systems, such as those used in online retail for generating recommendations from past purchase history. McGary et al. used the hypergeometric CDF, which gives the probability of seeing an intersection of size v or greater between phenotypes containing m and n genetic elements, with N total elements in the species pair (Figure 1A-B).
For f
i
j
l
we used v/n, the fraction of the number of genes common to both phenotypes j and l over the number of genes known to be involved in phenotype j, which empirically appears to be a good approximation of the probability that a candidate gene from a single phenolog will turn out to be a true positive [9].
While the naïve Bayes method multiplies distances or similarities as if they were probabilities, for the additive method, X
i
j
is calculated for each gene-phenotype pair (i,j) by taking the sum over all nearest neighbor phenotypes k, weighted by the similarity between phenotypes j and k, so
(4)
where Φ is a phenotype matrix and w is a weight matrix of phenotype-phenotype similarity scores.
In addition to the hypergeometric CDF and Pearson sample correlation, we tested Euclidean distance, taxicab (Manhattan) distance, cosine distance and Tanimoto coefficient as measures of phenotypic similarity, both for finding the k nearest neighbors and as weighting functions. We expected that orthologous phenotypes from closely related species might show more similar gene sets than those from more distantly related species. In turn, various distance functions might account for this bias to a greater or lesser extent; we thus compared different distance functions using cross-validation, as noted later. Euclidean and Manhattan distance performed extremely poorly in the gene framework, using five-fold cross-validation, so we excluded them from analyses in the orthogroup framework. Overall, the Pearson coefficient and hypergeometric test appear to have the most power for identifying nearby predictive phenologs (Figure 3A).
We also repeated the analysis while varying the distance function (used for searching) and holding the recommendation function (w) the same, and vice-versa (Figure 3B). Pearson sample correlation showed the best performance among the distance functions; however, we found that the hypergeometric CDF was the best weighting function for assigning prediction scores to genes.
We compared the naïve Bayes and additive classifiers, with the results shown in Figure 4. The performance in cross-validation is quite similar between the two classifiers, with the best version of the naïve Bayes classifier (using Pearson sample correlation for distance and hypergeometric CDF for weighting) performing slightly better than the best additive one (using Pearson sample correlation and hypergeometric CDF). However, the additive classifier allows us to visualize and deconstruct the predictions into component phenotypes. We therefore chose to use the additive classifier for most predictions.
Varying the maximum number of neighbors (k) tends to affect lower-ordered predictions (e.g., the thousandth gene predicted for a disease) to a larger extent than top predictions. Figure 5 shows that even including the k=5 nearest neighbors improves the results modestly — raising the number of diseases for which the withheld genes can be predicted at a top-100 median rank from around 50 to 80. Searching for the k=40 nearest neighbors seems to offer no meaningful improvement over k=10 at relevant ranks. Thus, while a higher value of k may not always provide the best predictor, it is more likely, on average, to be useful than only considering the single closest phenotype.
Some phenotypes were intrinsically unpredictable; notably, several of these were revealed to be combinations of unrelated diseases that were overcollapsed into the same entity in the initial version of our OMIM database (two such examples were achromatopsia with achondroplasia, and the combination of all blood type genes), thus serving inadvertantly as negative controls.
The best similarity functions produced highly correlated predictions. Further, the best predictions of the worst classifiers were highly correlated with the best predictions of the top-performing classifiers. We thus concluded that the potential benefits of a fusion or blending classifier, a model that draws the best characteristics from simple classifiers via optimization, would be modest at best. At worst, any improvement would be difficult to measure; optimization of such a blending model would require an additional layer of cross-validation, and many phenotypes would need to be dropped due to relatively small associated gene sets.
While similarity functions produced remarkably similar results, predictions coming from different species were much less strongly correlated (Figure 6), suggesting that weighting phenotypes by species in some manner may offer additional improvement. While each species provides uncorrelated prediction information, the human disease predictions are dominated by mouse whenever that species is included — likely because of the highly correlated nature of the exploration of gene-phenotype associations in mouse and human.
Finally, we measured the extent to which our predictive performance was improved compared to random trials. To measure this enrichment, we generated a series of random gene-based matrices. For each phenotype-column of cardinality p, we marked p randomly-drawn genes as observed. We attempted to predict phenotypes-of-interest from these randomized matrices using our regular classifiers (Figure 7). (Note that we did not repeat the randomized matrix control for the orthogroup-based matrices, primarily because randomization of gene-phenotype associations eliminates the type of structure which made orthogroup-based matrices necessary.)
Importantly, we see a strong improvement in predictive performance on actual gene-phenotype associations as compared to randomized data. The method is able to recover all known genes for several real diseases — but is unable to recover withheld genes for any of the randomized diseases.
Epilepsy
In addition to evaluating our method’s overall performance, we wished to take a closer look at its prediction of individual diseases in our database. We chose epilepsy because, despite offering ostensibly correct predictions, it actually scores somewhat poorly in cross-validation. In our initial three-fold leave-one-out test, only one of the three separately withheld genes was recovered at a reasonably testable rank (twelve, in this case).
Our method successfully identifies GABBR1, GABBR2[12], and KCNA1[13], which were absent from our database but known to be associated with the disorder. These were predicted primarily due to mouse phenotypes that resemble epilepsy (clonic seizures and abnormal brain wave pattern; Figures 8 and 9).
Top epilepsy predictions include PAX6, PRRX1, and RAX2 (of which PAX6 has been associated with seizures); and PAX3, PAX7, HESX1, and NKX2-1, NKX2-4, NKX2-6, and NKX2-8 (Figure 9). Notably, NKX2-1 is involved in mouse epilepsy [14], and PAX3 appears in a region linked to the human version of the disease [15]); neither of these genes were in our database.
Interestingly, these predictions come from the Arabidopsis phenotypes regulation of gene expression by genetic imprinting, cotyledon development, epidermal cell differentiation, and gene silencing by RNA, as well as the yeast phenotype annotation for sensitivity to trichlormethine (nitrogen mustard, or tris(2-chloroethyl)amine).
To learn more about the general predictability of the epilepsy phenotype, we ran an expanded cross-validation, withholding each of the full set of 51 epilepsy genes in our database, and found that six genes could be predicted back — all within the top 120 ranks. We note that even when a phenotype performs poorly in cross-validation, it seems that our method still provides useful predictions.
We wanted to know the extent to which predictions could be attributed to paralogy (shared orthogroup membership) with genes already associated in our database with epilepsy. GABBR1 and GABBR2 are each singleton orthogroup members, and are thus independently predicted. KCNA1 and KCNA2 emerged as paralogs following the human-worm divergence, but are predicted from non-worm phenotypes — and are therefore also independent predictions.
PAX6’s plant-human paralogs make up the top three rank bins in Figure 9. We suggest that even non-independent predictions are of use, provided they are accompanied by independent predictions — since, as mentioned, PAX3, NKX2-1, and PAX6 are all associated to some degree with seizures and/or epilepsy. Indeed, the inclusion of species in which these genes are not paralogs offers additional resolution on predictions and demonstrates the utility of our method.
Predicting from E. coli— Pharmacologically-induced seizures
We then turned to a similar mouse phenotype, pharmacologically-induced seizures, to determine whether E. coli gene-phenotype associations could be used to make predictions about mammalian associations without additional information. We found that mouse genes linked to pharmacologically-induced seizures could be predicted extraordinarily well from E. coli alone in cross-validated tests: eight of the 48 genes associated with this mouse phenotype could be predicted back when withheld. These results are particularly impressive because they represent all six of the mouse-E. coli orthogroups associated with this seizure phenotype. Two of the orthogroups (Grik2/Grik5 and Slc1a2/Slc1a3) are in the top prediction ranking bin; additionally, Faim2 is in the top hundred ranks (Figure 10).
Next, we examined the predictions for promising new candidate genes. One of the most intriguing candidates was α-adducin, which is known to be reduced in the brains of rats experiencing kainate-induced seizures [16].
Another interesting prediction is Sv2a (synaptic vesicle glycoprotein). It was recently reported that a mutation in chicken SV2A leads to photosensitive reflex epilepsy [17]. Mouse Sv2a is a known binding site for levetiracetam, an antiepileptic drug [18], and Sv2a−/− mice experience seizures and die within three weeks of birth [19, 20].
We also examined the compounds associated with the source E. coli phenotypes to see if these were associated with seizures. The compounds included ethanol — alcohol poisoning and alcohol withdrawal symptoms include seizures — as well as paraquat, which causes seizures and brain damage in rats [21]; and aztreonam, which is a convulsant [22]. While many compounds might cause seizures if given in sufficient amounts, a control PubMed search for ten randomly chosen compounds associated with E. coli phenotypes in our database failed to turn up such clear associations.
Thus, both at the level of predicting candidate genes and affiliated compounds, the E. coli phenotypes appear to be relevant. Both of the genes discussed above may indeed represent reasonable new candidates for affecting pharmacologically-induced seizures and might warrant follow-up experiments. Finally, it is particularly striking that mammalian seizures — which are distinctly neurological phenomena — could be derived from processes so fundamental as to exist even in bacteria.
Atrial fibrillation
We looked next at the human heart phenotype atrial fibrillation (AF), expecting to find that AF, like pharmacologically-induced seizures, was rooted in highly-conserved signaling defects. Instead, we found that the most predictive phenotypes were from mouse and chicken — quite unlike epilepsy, for which plants, worms, and yeast offered a great deal more information than mouse or chicken.
The AF phenotype performed well cross-validation in the gene-based configuration method. However, in the orthogroup-based cross-validation, only three of the eight genes associated could be predicted after being withheld. The removed genes were predicted at ranks 3-4, 15-16, and 81-94. Nonetheless, the novel predictions for this phenotype are worth noting.
The top-ranked new prediction for atrial fibrillation (AF) is the histamine receptor H2 (HRH2), largely contributed by gastrointestinal phenologs in mouse (Figure 11). Histamine has been known to act on heart cadence for over a hundred years [23]. However, an empirical link between heart and gastrointestinal function was established by the recent observation that histamine increases the heart rate in pythons during digestion [24] — regulation which both occurs via the H2 receptor [25-27] and is apparently ubiquitous in vertebrates.
Similarly predicted are ATP4A and, further down the list, ATP4B, which are the α and β subunits of the H+/K+ ATPase. This proton pump is responsible for gastric acid secretion during digestion.
A somewhat speculative connection is offered by recent work, which showed cigarette smoke extracts cause an increase in the amount of H+/K+ ATPase in the stomach [28]. It is unclear — and worth testing — whether ATP4A and ATP4B are expressed in the heart. These genes could offer an additional route by which smoking contributes to heart problems.
Following HRH2 and ATP4A is HOPX, or homeodomain only protein x, which is down-regulated during heart failure in humans [29]. It is not clear that HOPX is involved in AF per se, but again worth exploring in future experiments, as is the gene ranked next, KCNE1, based on orthologous phenotype prolonged QT interval — and seemingly also a factor in rare cases of atrial fibrillation [30-32].
GJA1 (gap junction protein, α1, also known as connexin 43 or Cx43) is one of the two most abundantly expressed connexins in the heart [33-35]. The other is GJA5 (connexin 40), already associated in our database with AF. Cx40 and Cx43 seem to form heteromeric channels with different properties from homomeric channels [36]. Cx43, unlike Cx40, is essential for heart development and cardiac impulse conductance in mice [37]. Tuomi et al. observed that a dominant negative Cx43 mutant causes severe AF [38]. Finally, atrial fibrillation was observed in a somatic mutation in human GJA1[39]. Notably, another channel similarly implicated was SCN5A (human cardiac sodium channel, voltage-gated, type V, α subunit). This sodium channel component has been associated with atrial fibrillation [40-42] but was missing from our database.
In terms of the predictor phenotypes themselves, the top AF phenologs can be grouped into three basic categories: cardiac, gastric, and auditory. We have explored the first two categories, but have not considered genes from the third. We note that while Jervell and Lange-Nielsen syndrome (i.e., long QT syndrome) has been associated with deafness for half a century [43-46] via alleles of KCNQ1[47] and KCNE1[48], other genes may yet be involved [49]. Further, Belmont et al. write of “a growing appreciation for conditions that affect hearing and which are accompanied by significant cardiovascular disorders” [50].
Given the success with which our method was able to predict AF genes — and with which it was able to identify potentially related disorders — exploration of additional candidates (e.g., ATP4A/B, POU4F3, and S1PR2) from Figure 11 may be warranted.
Plant phenotypes — response to vernalization
Finally, having focused primarily on predicting mammalian phenotype- and disease-genes, we asked whether plant gene-phenotype associations could be predicted from the other species in our database.
Plants represented a particular challenge, since a number of factors reduce the specificity of predictions for plant phenotypes. Firstly, while human phenotypes are predicted at least in part from other mammals and even other vertebrates — which are phylogenetically similar — there are no close neighbor species to Arabidopsis in our database.
Secondly, while 19,439 of the 28,002 human genes in our database have orthologs in other species, the ratio is less promising for A. thaliana phenolog predictions: 12,668 of 27,325 have orthologs. The cause is likely again the lack of other plants in our database, compared to the several vertebrates from which to draw information for H. sapiens.
Third and finally, the Arabidopsis genome contains a great deal of redundancy, as observed in [51]: 37.4% of proteins belong to families of more than five members, compared to 12.1% in fruit fly and 24.0% in worm. In predictions that rely on gene orthology, as with phenologs, there is often no way to distinguish which of the plant paralogs is most relevant — except perhaps by relying on paralogous phenotypes.
Our orthogroup-based matrix formalism (Figure 2D) was thus critical for addressing the extensive divergence of gene families between distant species. In particular and as previously described, when attempting to predict Arabidopsis phenotypes, we noticed that the gene-based formalism resulted in asymmetric scores and unwarranted improvements in rank of certain predictions, particularly those where large orthogroups were involved (Figure 2A-C). The genes-as-rows configuration also inflated performance, as measured by ROC plots, during cross-validation — primarily due to the high frequency with which plant gene expansions co-participate in a biological process.
Given this formalism, we then determined phenotypes which may be predictable by cross-validating predictions produced from all non-plant species in the database (Figure 12), and found a large number (greater than 50) of the plant phenotypes to be reasonably well-predicted based on non-plant phenotypes. We describe final predictions for the response to vernalization phenotype (Figure 13).
We selected this phenotype because it scores better than most other plant phenotypes in cross-validation; seven of the fifteen genes in this plant phenotype can be predicted back at low rank when withheld, representing two or three orthogroups (about half of the total number of orthogroups) depending upon the source species considered.
Although we cannot easily cross-validate predictions from paralogous phenotypes, since they are not sufficiently independent, we speculate that the inclusion of paralogous phenotype data may help to improve the specificity of the predictions at no perceivable cost.
Among the new candidate genes implicated, two are particularly notable. One of these, EMF2 — which appears to be associated with vernalization-mediated flowering by its interaction with CLF[52] — is predicted based on seemingly unrelated orthologous mouse and human phenotypes (abnormal chorion morphology and endometrial cancer, respectively). EMF2 is paralogous with known vernalization gene VRN2; however, it is ranked ahead of VRN2 by its association with the related plant phenotype negative regulation of flower development. That EMF2 was boosted by a potential paralogous phenotype supports the hypothesis that paralogous phenotypes are similarly useful to orthologous phenotypes in predicting gene function.
The second interesting prediction is FWA and its several paralogs (HDG1-4, HDG7-12, PDF2, ANL2, ATML1, AT5G07260, and HB-7). Certain FWA mutants produce a vernalization-insensitivity phenotype [53, 54]. Candidates ANL2 and PDF2 both have late flowering phenotypes [55, 56] markedly similar to that of FWA[57]. That discovery lends additional support for paralogous phenotypes, as neither FWA nor PDF2 were associated in our database with regulation of flower development — but our method successfully identified negative regulation of flower development as a potential phenolog.
Empirically, we believe that the strength of our predictions for plant phenotypes are limited primarily by the quantity of gene-phenotype information available for plants. Notably, the addition of associations from other plant species would prove exceptionally useful for predicting not only Arabidopsis phenotypes, but crop species as well, and — as demonstrated earlier — even animal phenotypes.
Datasets
We sought to determine whether the improvement in our method over the original Phenologs method could be attributed in part to the additional species information or arose exclusively from incorporating phenotypes beyond the nearest neighbor (as demonstrated in Figure 5).
We began by plotting the performance of those predictions drawn from the species used by McGary et al. (mouse, nematode, yeast, and plant, short-hand mcgary+green, notably including the additional phenotypes from Green et al.). As expected, we found that increasing k from 0 to 40 markedly improved the results (Figure 14).
Next, we tried adding the new species (chicken, E. coli, and zebrafish) individually, and found improvements at relevant ranks for E. coli and D. rerio. Surprisingly, inclusion of chicken data hurt the predictive performance. We also tried adding E. coli and D. rerio at the same time, and saw an additional increase in performance.
Given the decrease in performance resulting from the inclusion of chicken phenotypes, we sought to determine whether the additional C. elegans datasets were negatively affecting performance. In Figure 14B, we tried leaving out the broad and specific components of the green dataset. We found that either component alone performed worse than the combination. We also tried leaving out the green datasets altogether, and found that their inclusion moderately improved performance at relevant ranks, but decreased performance beyond around rank 45.
These mixed results with the datasets are somewhat surprising. We expected that the in situ hybridization expression annotations from GEISHA would be useful for human predictions not only because gene expression stage and location should correlate highly with phenotype, but also because chicken — like mouse — is more closely related to human compared to other species in our database.
The distributions of genes per phenotype for human, mouse, Arabidopsis, and chicken were similar (not pictured). Only E. coli differed substantially, with most phenotypes involving between 800 and 1,000 genes. However, in general, the counts of E. coli-human orthologs involved in bacterial phenotypes are much smaller due to the relatively small fraction of genes with orthologs between the two species.