Consequences of choosing a mutual information score threshold (for predicting functional linkages among proteins) based on fully shuffled protein profiles. Results from phylogenetic profile comparison of 11,935 pairs of 155 bacterial-specific E. coli proteins. (a) Distribution of mutual information scores using reference set BAE4. The blue curve represents the distribution of scores for all 11,935 protein pairs (using actual profiles), while the green curve represents the distribution of scores for 983 protein pairs (using actual profiles) with ≥50% pathway similarity. The dashed curves represent the distribution of scores for 11,935 protein pairs using shuffled profiles. The dashed red curve is used for shuffled protein profiles obtained by shuffling all of the entries in actual protein profiles. This type of shuffling implicitly assumes that each protein under study is present in all lineages/kingdoms, an assumption that is incorrect for lineage-specific proteins. The dashed blue plot is for restrictively shuffled protein profiles that were obtained by shuffling only the profile entries corresponding to bacterial genomes. (b) Plots depicting the relationship between the mutual information threshold and the p-value (probability that the score for a pair of proteins meets or exceeds the chosen mutual information threshold). The statistical significance for a pair of proteins having a certain similarity score (using actual profiles) could be overestimated if fully shuffled profiles (dashed red curve) were used to model the behavior of unrelated pair of proteins instead of restrictively shuffled profiles (dashed blue curve). (c) Relationships between the mutual information thresholds (for predicting positives and negatives) and the positive predictive values (prediction accuracy). This plot illustrates that the commonly used approach of choosing a threshold based on the distribution of scores from completely shuffled profiles (0.6 based on the dashed red curve in (a)) may lead to a significant fraction of predictions being false-positives. On the contrary, a cutoff chosen based on the distribution of scores from restrictively shuffled profiles (1.1 based on the dashed blue curve in (a)) more than doubles the prediction accuracy, albeit decreasing the coverage.