Figure 14From: Prediction of gene-phenotype associations in humans, mice, and plants using phenologsMeasuring the effect of additional datasets on predictive performance. Here, we used our best classifier (naïve Bayes with Pearson sample correlation for a distance function, weighted by hypergeometric CDF), and subtract out datasets in order to determine their relative contributions. Unless otherwise indicated, classifiers were run with k=40. (a) demonstrates that for the original species used by McGary et al. (also including the new phenotypes from Green et al.), the k nearest neighbors method performs substantially better from the original Phenologs method (approximated by k=1). The datasets are labeled mcgary (mouse, worm, nematode, yeast, and plant), green (nematode), Dr for zebrafish, Ec for E. coli, and Gg for chicken. The best-performing analysis was repeated (labeled “(1)” and “(2)”, with different random test genes withheld) to demonstrate that performance is robust under cross-validation. (b) presents a test of whether specific phenotypes are more useful than broad phenotypes, by breaking down the green dataset into its components, green-specific and green-broad. We found that including both green datasets yielded the best results at relevant ranks, but that they both hurt results at less relevant ranks (beyond 45). Also shown is a comparison between the original datasets (mcgary alone) and the best-performing collection from (a), with all datasets except chicken (represented by the solid cyan line).Back to article page