Skip to main content
Figure 14 | BMC Bioinformatics

Figure 14

From: Prediction of gene-phenotype associations in humans, mice, and plants using phenologs

Figure 14

Measuring the effect of additional datasets on predictive performance. Here, we used our best classifier (naïve Bayes with Pearson sample correlation for a distance function, weighted by hypergeometric CDF), and subtract out datasets in order to determine their relative contributions. Unless otherwise indicated, classifiers were run with k=40. (a) demonstrates that for the original species used by McGary et al. (also including the new phenotypes from Green et al.), the k nearest neighbors method performs substantially better from the original Phenologs method (approximated by k=1). The datasets are labeled mcgary (mouse, worm, nematode, yeast, and plant), green (nematode), Dr for zebrafish, Ec for E. coli, and Gg for chicken. The best-performing analysis was repeated (labeled “(1)” and “(2)”, with different random test genes withheld) to demonstrate that performance is robust under cross-validation. (b) presents a test of whether specific phenotypes are more useful than broad phenotypes, by breaking down the green dataset into its components, green-specific and green-broad. We found that including both green datasets yielded the best results at relevant ranks, but that they both hurt results at less relevant ranks (beyond 45). Also shown is a comparison between the original datasets (mcgary alone) and the best-performing collection from (a), with all datasets except chicken (represented by the solid cyan line).

Back to article page