 Methodology Article
 Open Access
 Published:
A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support
BMC Bioinformatics volume 18, Article number: 361 (2017)
Abstract
Background
Probabilistic assessments of clinical care are essential for quality care. Yet, machine learning, which supports this care process has been limited to categorical results. To maximize its usefulness, it is important to find novel approaches that calibrate the ML output with a likelihood scale. Current stateoftheart calibration methods are generally accurate and applicable to many ML models, but improved granularity and accuracy of such methods would increase the information available for clinical decision making.
This novel nonparametric Bayesian approach is demonstrated on a variety of data sets, including simulated classifier outputs, biomedical data sets from the University of California, Irvine (UCI) Machine Learning Repository, and a clinical data set built to determine suicide risk from the language of emergency department patients.
Results
The method is first demonstrated on supportvector machine (SVM) models, which generally produce wellbehaved, well understood scores. The method produces calibrations that are comparable to the stateoftheart Bayesian Binning in Quantiles (BBQ) method when the SVM models are able to effectively separate cases and controls. However, as the SVM models’ ability to discriminate classes decreases, our approach yields more granular and dynamic calibrated probabilities comparing to the BBQ method. Improvements in granularity and range are even more dramatic when the discrimination between the classes is artificially degraded by replacing the SVM model with an ad hoc kmeans classifier.
Conclusions
The method allows both clinicians and patients to have a more nuanced view of the output of an ML model, allowing better decision making. The method is demonstrated on simulated data, various biomedical data sets and a clinical data set, to which diverse ML methods are applied. Trivially extending the method to (nonML) clinical scores is also discussed.
Background
Clinical decision support systems can be defined as any software designed to directly aid in clinical decision making in which characteristics of individual patients are matched to a computerized knowledge base for the purpose of generating patientspecific assessments or recommendations that are then presented to clinicians for consideration [1, 2]. They are important in the practice of medicine because they can improve practitioner performance [1, 3,4,5], clinical management [6, 7], drug dosing and medication error rates [8,9,10], and preventive care [1, 11,12,13,14,15,16].
Machine learning (ML) gives computers the ability to learn from, and make predictions on the data without being explicitly programmed regarding the characteristics of that data [17]. It should not be surprising, then, that ML pervades clinical decision support, for two reasons. First, clinical decision support systems are structured such that patients are represented as features which can be used to map them to categories [18]. Second, healthcare data are complex  they can be distributed, structured, unstructured, incomplete, and not always generalizable.
Although logistic regression is widely used in biomedicine and it is highly recommended over ML approaches, ML algorithms have been used in many modern clinical decision support systems, ranging from predicting the incidence of psychological distress in Alzheimer’s Disease [19] to postcardiacarrest neuroprognostication [20]. A Google Scholar search of “machine learning biomedical” renders over 385,000 results.
However, there is a problem when ML algorithms are used for clinical decision support. The output of a ML model is usually a real number that is thresholded to produce a binary output. This outcome appears to come from a “black box”—a system module whose functioning is opaque. Yet, caregivers and patients prefer probabilistic statements [21,22,23,24,25,26,27]. But this “black box” approach runs counter to the goal of improving the decisionmaking power of physicians by providing more – not less – information to make better decisions [28]. In other words, “this patient has a 51% chance of developing heart disease” is more informative than a binary output of: “a ML algorithm has indicated that this patient belongs to a group of patients that develops heart disease.”
The effect of expressing clinical results probabilistically has been studied for decades. As early as 1977, Shapiro [29] introduced a method for assessing the predictive skills of physicians versus the results of “computerized procedures” that had been designed to provide probabilistic predictions of various clinical outcomes. Hopkins [30] suggested optimal plainlanguage descriptions of probabilities in a clinical setting. Grimes and Schulz [31] found that combining an accurate clinical diagnosis with likelihood ratios from ancillary tests improved diagnostic accuracy in a synergistic manner. Along these lines, Wells et al. [32] and Kanis et al. [33] provided specific examples of how probabilistic assessments of proximal deep vein thrombosis and bone fracture risk, respectively, could improve clinical outcomes.
Presenting results in probabilistic terms is as important to patients as it is to clinicians. Doctors using the decisionmaking probabilistic process will give information to patients about risks and benefits, often in numerical terms [34, 35]. Trevena et al. [36] found that patients have a more accurate understanding of risk if probabilistic information is presented as numbers rather than words, even though some may prefer receiving words.
The goal of this article is then to ensure that both patient and clinician can gain as much information as possible, and in the most straightforward way possible, from the output of an arbitrary ML algorithm by effectively converting MLgenerated outputs to probabilities. The assumption here is that the clinician is uninterested in a simple cutoff, but wants to gain an intuitive sense to what degree the ML classifier “believes” that a datum belongs to one class or another. But for those who desire a threshold, the calibration is all the more important, since the rational choice of one class over the other is determined by whether the class probability is greater or less than 0.5.
There are three common calibration methods used to calibrate ML outputs to probabilities today: Platt Scaling [37], Isotonic Regression [38], and Quantile Binning, which are discussed in turn [39].
Platt’s method fits a logistic regression (LR) model to the ML scores from a training set, thereby providing an equation that directly transforms an MLbased classifier score to a probability. Although the LR model is not always appropriate and is prone to overfitting for small training sets, it can provide good calibration in certain circumstances (e.g., when Support Vector Machines are used as classifiers).
In an attempt to improve upon Platt’s method, the isotonic regression (IR) approach releases the linearity assumptions in the LR model, fitting a piecewise constant nondecreasing function to the sorted ML scores in the training set. Although this calibration can yield good results, the isotonicity assumption is not always valid. In fact, NiculescuMizil and Caruana [40] demonstrated, using multiple classifiers and data samples of varying size, that both the Platt and IR methods can produce biased probability predictions.
Quantile Binning, on the other hand, mitigates the assumptions in the Platt and IR approaches by sorting the ML scores from a training set, and partitioning them into subsets (bins) of equal size. A new ML score can be simply transformed to a probability by locating its corresponding bin, and then calculating the fraction of positive outcomes in this bin from the training set [39]. While less restrictive than the other approaches, the drawbacks of this method include the fact that the number of bins must be set a priori, and that small training sets can corrupt the calibration. The Bayesian Binning in Quantiles (BBQ) method mitigates these limitations by effectively averaging over many binning schemes, which leads to a better overall calibration [41].
While it is difficult to argue with the overall accuracy and generalizability of the BBQ method, the present work will demonstrate that the granularity and dynamic range of calibrated probabilities, and in some cases the calibration accuracies, can be substantially improved by applying a novel nonparametric Bayesian approach. As with the previous methods, this approach requires a training set. But rather than using it to build a mapping between ML outputs and probabilities, the distributions of ML output from the positive and negative classes are directly compared to the ML output in question, rendering a probability that the ML output is derived from the one distribution versus the other.
Since the ML output is compared to the ML outputs of the two classes, a nonparametric approach is required, as there is no obvious binning strategy. Although there are many nonparametric Bayesian methods for comparing twosamples [42,43,44,45], nonparametric Bayesian methods for specifically quantifying the probability of distribution pairings (i.e., comparing the similarity of distribution A and B versus the similarity of A to C) are rare. Capitalizing on its power and simplicity, the Bayesian nonparametric twosample comparison approach in Holmes et al. [46], is modified for this purpose. The improved calibration then arises from the nonparametric approach that effectively allows for an infinite number of binning schemes, and from naturally including statistical uncertainties due to finite training samples.
The methodology is tested on a variety of data sets that have been classified using two different ML techniques. It will be found that the method provides probability estimates with a high granularity within a broad range of calibrated probabilities. This is important for many clinical applications. For example, in risk assessment studies routinely performed by institutional review boards, government agencies, and medical organizations, it is crucial to be able to compute probabilities that are typically <1% [47,48,49,50]. Additionally, clinical literature abounds in examples where probabilities are expressed, or thresholds are determined, via plotting the logarithm of probabilities, to ensure interpretability at the extremes of the probability range [51,52,53].
Methods
In the proposed approach, a binary ML classifier with a nondiscrete score is assumed. It is further assumed that a training set is available, from which distributions of independent scores can be generated for the two classes in the data set. These distributions can be obtained by evaluating the score of the classifier applied to left out points during the leaveoneout (LOO) cross validation procedure. To determine the probability that a new datum is derived from a certain class, the ML classifier is evaluated for that datum. Then, a nonparametric Bayesian hypothesis test is applied to calculate the probability that the datum is derived from the parent distribution of that class as opposed to the parent distribution of the other class.
Mathematical formalism
The (posterior) probability introduced above is calculated by modifying the formalism in Holmes et al. [46], which constructed a nonparametric Bayesian twosample hypothesis test. In detail, suppose the probability that a single value X _{p} is derived from the parent distribution that generated a series of values X_{1}, as opposed to the parent that generated values X_{2}. The objective is to calculate Pr(H _{1}  X _{p} , X_{1} , X_{2}), the posterior probability of the hypothesis H _{1} that X _{p} and X_{1} are derived from the same parent. The alternative hypothesis, H _{2}, is that X _{p} is derived from the parent of X_{2}. The probability of interest can then be expressed as
where Pr(X _{p} , X_{1} , X_{2}  H _{1}) is the likelihood of obtaining X _{p}, X_{1}, and X_{2} given that X _{p} and X_{1} are derived from the same parent distribution, and Pr(H _{1}) is the prior probability for the hypothesis H _{1}. The prior Pr(H _{1}) is simply a number, containing a priori estimates of the occurrences of observations from class 1. The calculation of Pr(X _{p} , X_{1} , X_{2}  H _{1}), on the other hand, is calculated with the help of Polya Trees [54].
Polya trees are a set Π of nested partitions in some space Θ. In this work, Θ is a one dimensional space where the ML scores are plotted. The partitions are generated by setting upper and lower bounds for the ML score derived from the training set, and then halving the space in several consecutive steps. At the start of the procedure, there is only “level 1” partitioning, where the two bins contain the number of score values, N _{0} and N _{1}, that fall on each side of the partition. Each segment of the space is then halved again, producing a total of 4 bins for the “level 2” partitioning which contain the counts N _{00}, N _{01}, N _{10}, and N _{11}, and so on.
Figure 1 illustrates the partitioning and labeling of such counts in each bin. The q _{X}'s indicate the probability of a value falling into the right vs. left partition. For instance, q _{00} is the probability of one of the N _{00} counts contained in bin ‘00′ falling into bin ‘000′ vs. bin ‘001′ at the next partitioning step.
Pr(X _{p} , X_{1} , X_{2}  H _{1}) can then be constructed. Let us assume that the parent distribution for class 1 is described by some set of binomial parameters, Q. Likewise, suppose the parent distribution for class 2 is described by R, and P describes the parameters in the parent distribution of the “new” ML score. P is then equal to Q assuming hypothesis H _{1}, and to R, assuming the alternative hypothesis H _{2}. X _{p}, X_{1} and X_{2} are realizations of P, Q, and R, respectively. Assume that, at the j ^{th} partition, l _{j0}, m _{j0} and n _{j0} (l _{j1}, m _{j1} and n _{j1}) are the counts of values that fall on the left (right) side of the split in distributions X _{p}, X_{1} and X_{2}, respectively. The likelihood that q _{j0} (1 − q _{j0}) at the j ^{th} partition is the same for distribution P and Q, but not R, is then:
where Γ is the gamma function, δ is the Dirac delta function, { α _{j0} , α _{j1} } are parameters defined following a procedure described later in this section, and \( \tilde{j}=\left\{\varnothing, 0,1,00,01,10,11,001,101,\dots \right\} \) (following the notation in Holmes et al. [46] and Fig. 1). Each p _{ ∗0}, q _{ ∗0} and r _{ ∗0} are independently drawn from Beta(α _{ ∗0},α _{ ∗1}).
Note the second set of brackets in Eq. 3 encompass the prior section which is comprised of two components: Dirac delta functions that act to tie p and q together through p ^{′}, and terms involving gamma functions, which are Dirichlet priors.
Because each partition is assumed to be independent:
P (X _{p} , X_{1} , X_{2}  H _{2}) takes a similar form. With these two likelihoods, then, the posterior probability P (H _{1}  X _{p} , X_{1} , X_{2}) can be calculated explicitly.
There are several practical considerations to keep in mind while calculating the posterior above. One is that the definition for α _{X} is adopted from Holmes et al. [46], where the α’s are set to be constant in a level such that α _{L} = L ^{2} = α _{j0} = α _{j1}. Another point to consider is that floating point precision can lead to redundant score values. However, at least in the data sets considered in this work, stopping at the level where the values cannot be partitioned further is sufficient. In fact, it was found that in the data sets considered in this work, the number of levels could be limited to <19 without loss of calibration accuracy or granularity. However, it remains to be seen how generalizable this threshold might be.
The lower and upper bounds of the distribution also need to be determined. Holmes et al. [46] suggested partitioning in terms of quantiles. However, a more straightforward approach was found to be sufficient, where the partition is centered at the median of the training sample, and then expanding the upper and lower bounds of the partition space by equal amounts until it included all the points.
Lastly, priors on H _{1} and H _{2} are determined by the relative sizes of the classes in the training set.
Comparing the BBQ method and the proposed approach
In this section, the method for generating reliability diagrams using a variety of data sets and ML classifiers to compare the stateoftheart BBQ method and proposed method is described. Reliability diagrams [40, 55, 56] are generally used to evaluate the accuracy and granularity of the conversion methods by comparing the observed (true) frequency of an event with the predicted probability of an event. The predicted probabilities are discretely sorted into 10 bins, and for each bin, the mean predicted value is plotted against the true fraction of positive cases. The better the calibration, the closer the points will fall to the diagonal line. The finer the granularity, the more points (occupied bins) will be on the diagram.
The following two ML methods are used: a standard SVMbased classification method with a wellbehaved, well understood score; and an ad hoc discriminant classification method constructed from a kmeans algorithm.
The kmeans discriminant is calculated by clustering a training set that contains two distinct classes of objects, and then determining which labels best represent each cluster. The centroid is determined for each cluster, and the label of a new (test) point is assigned via determining which centroid is proximal. Assuming two classes, A and B, the kmeans discriminant is then defined as the ratio of the distances of the new point to the two centroids. (Along the same lines, the tuning of the SVM parameters and feature selection methods are also kept to a minimum to ensure a wide range of predicted probabilities for the reliability diagrams).
The unconventional definition of the kmeans discriminant serves two purposes. First, the algorithm renders a classifier that has marginal performance, thereby allowing a better understanding of the proposed method’s behavior when there is a large overlap. Second, the kmeans classifier output distributions are highly nonGaussian, allowing insight into the proposed method’s generalizability.
The methods are demonstrated on three type of data sets: simulated classifier outputs, data sets from a popular ML data set repository, and a clinical data set. Each data set is divided into training and test subsets. The training sets are used to generate the distributions for the two classes, X_{1} and X_{2}. The test sets are then used to create the reliability diagram, where each point in the test set, X _{p}, is compared to X_{1} and X_{2} using both BBQ and the proposed method.
The simulated classifier outputs are generated from Gaussian distributions. The training set contains 50 positive cases randomly generated from a Gaussian distribution with zero mean and unit variance, and 50 negative cases are randomly generated from a second Gaussian distribution with a unit variance and certain fractional overlap with the first distribution (i.e., nonzero mean). With the BBQ and proposed method trained on these data, reliability diagrams are constructed on 100 test data with an equal number of positive and negative cases. The number of calibrated points in the reliability diagrams, range of predicted probabilities, and the goodness of fit of the calibrated points are evaluated. This training and testing is repeated 20 times for a given overlap in the Gaussian distributions and the results are averaged.
The biomedical data sets, described in Table 1, were taken from the University of California, Irvine Machine Learning repository [57, 58]. Although the balances between positive and negative instances vary dramatically between these data sets, any overfitting resulting from these imbalances would be accounted for in the calibration. To see this, suppose a ML algorithm produces an overfitted model if the data set is imbalanced. This imbalance is roughly approximated in the ‘training’ folds of the LOO crossvalidation used to produce the distributions of positive and negative instances for the calibration. Any biases resulting from the ML algorithm’s tendencies to overfit are then accounted for in these distributions, since they are constructed from the test folds of the crossvalidation.
The clinical data set, built to identify suicidal individuals using their language, contains the word frequencies of 161 suicidal and 153 control subjects from the Suicidal Adolescent Clinical Trial [59] and the Suicidal Thought Markers Study [60]. The data set contains 6226 unique words; a KolmogorovSmirnov test [61] was used to choose the top 124 most discriminating words for classification. The data with the reduced feature sets are L2 normalized on a persubject basis to increase the discriminatory power of the SVM classifier and to therefore produce a wider range of ML scores.
The practical implementation of the proposed method is described in the previous section. The BBQ method implemented through the corresponding R package [62], using the default parameters and the “BDeu2” core function, as it was found to give finer granularity of probabilities for the SVM than “BDeu”. It was also found to give a far better calibration (although with fewer calibrated points) for the kmeans algorithm on the Parkinson’s data set. However, the effect of changing these parameters will be explored.
Results
For the simulated data sets, reliability diagrams are constructed for various overlaps in the simulated ML output distributions. For a given overlap, the χ ^{2p}values quantifying the goodness of fit to a slope of 1, the number of calibration points, and the range in the calibrated probabilities are averaged and plotted. (The χ ^{2} is calculated by weighting the residuals by the inverse of the standard deviation of the calibrated probabilities). Figure 2 compares these averages as a function of the overlap. As evidenced by the χ ^{2} pvalues, the calibration accuracies for the proposed method are comparable if not higher compared to the BBQ method, especially for smaller overlaps. The exception to this lies in the region of largest overlap, where the BBQ ethod outperforms the proposed method; however both methods produce fits with pvalues greater than 0.2. Comparing the number of calibration points and calibrated probability ranges, it is clear the proposed method consistently outperforms the BBQ method.
But these results assume highly idealized (Gaussian) distributions for the ML outputs. Figures 3 and 4 then present the results from the biomedical data sets. They include the training set SVM and kmeans ML scores used to generate the reliability diagrams, and the reliability diagrams themselves plotted with the diagonals indicating perfect calibration. For comparison, the training distributions are generated using both LOO and 10fold cross validation. It can be seen changing the kfold crossvalidation used to build the training distributions simply leads to fewer calibration points for both BBQ and the proposed method.
Tables 2 and 3 show the χ ^{2} pvalues and number of calibrated points for the SVM and kmeans based classifiers, respectively, for both BBQ and the proposed method. One can see that the calibrations are, on average, comparable for the two methods. This is especially true when the ML scores from each class are unimodal and cleanly separated from the other class. Pairwise ttests between the χ ^{2} pvalues yield pvalues of 0.61 and 0.58 for the SVM and kmeans classifiers, respectively. However, the advantages of the proposed method become apparent for larger overlaps in the class distributions of ML scores. This is shown by comparing the accuracies, numbers of calibrated points, and range of calibration points for the SVM and kmeans method with more and less overlaps in the ML scores, respectively. Performing a pairwise, onesided ttest between the number of calibrated points for the two methods gives a pvalue of 0.19 for the SVM classifier, where he overlaps are smaller, indicating the BBQ and the proposed method render similar numbers of calibrated points. However, performing a similar test with the kmeans classifier where the overlaps are large gives a ttest pvalue of 0.002, indicating the method renders a systematically larger number of calibrated points. Performing the same test on the ranges, the pvalues are 0.06 and 0.01 for the SVM and kmeans classifiers, respectively, indicating a systematically more dynamic range of calibrated probabilities. That is, the results are more dramatic when the tests are performed on just those data sets with high overlap, highlighted in Tables 2 and 3. While the ttest pvalue for the χ ^{2} pvalues indicates comparable calibration accuracies (0.67), the ttest pvalues for the calibration points and ranges indicate substantial differences (0.0002 and 0.003, respectively). It can then be concluded that the proposed method renders a systematically larger number and more dynamic range of calibrated probabilities on the biomedical and clinical data sets. Note that, for either method, calibration does not seem to be affected by either sample size or the balance of the data set.
Although Naeini et al. [41] suggested optimum parameters for the BBQ method. It is worth exploring whether the comparisons with the proposed method may change if they are altered. The scoring method, binning (N0), and the threshold that determines the optimal binning (α) are then modified and the BBQ method is reevaluated on one of the data sets (the clinical data set) to gauge the parameters’ effect on the calibration. Table 4 shows the calibration points, range of calibration points, and reliability diagrams as a function of the changing BBQ parameters. It is clear from Table 4 that dramatically altering the BBQ parameters does not strongly effect the calibration for either the SVM or kmeans classifiers.
Discussion
In this work, a novel method for calibrating ML scores to probabilities was introduced. Using a number of data sets of varying sizes and two different ML methods, it was demonstrated that this method allows a more granular and more dynamic range of calibrated probabilities as compared to a current stateoftheart calibration technique (BBQ). This is not surprising given that, unlike the BBQ, our method is not limited to a finite set of binning schemes for the calibration, and it naturally folds in statistical uncertainties due to the limited size of the training sample. Also, the proposed method systematically pushes out the upper and lower boundaries of the calibrated probabilities, allowing for more extreme (dynamic) probabilities, which are crucial for assessing clinical risk. The advantages of the proposed method are particularly dramatic in the 8 cases boxed in Figs. 3 and 4, where the overlaps between the class distributions of ML scores becomes large. The results from the simulated data indicate that high accuracies in calibration are possible, especially when the overlaps in the ML score of the two classes are small.
Further, as evidenced by the results from the Lung Cancer, Parkinsons, Suicide, Arrhythmia, Breast Cancer and Contraception data sets, the imbalance of the train or test data sets do not have an effect on the accuracy of the calibration. Sample size also does not appear to strongly affect calibration either.
It is also interesting that both the proposed method and the BBQ method were trained using ML output distributions generated from LOO crossvalidation of the training set that was used to generate the ML model. The same training set was therefore used to train both the calibration method and the ML model, and both calibration techniques were able to calibrate the ML scores to a high overall accuracy. That is, the results suggest separate data sets might not be necessary to train the model and build the case and control distributions for the calibration. Decreasing the number of folds only decreases the granularity of both the BBQ and the proposed method, as demonstrated in Figs. 3 and 4.
In summary, the results indicate that the proposed method gives comparable or better accuracy (as indicated from the simulated ML outputs). Both the simulated and real data sets indicated a systematic finer granularity and greater range of calibrated probabilities using the proposed method, especially when there are large overlaps in the ML output distributions for the two classes. Tests on the clinical data set indicate changes in the BBQ parameters would not change these conclusions.
However, questions may remain as to why ML methods that return a nonprobabilistic result should be considered when there are so many probabilistic ML methods in the literature. For instance, in Sowa et al. [63], logistic regression (LR), decision tree (DT), supportvector machine (SVM), and random forest (RF) models were trained to distinguish between individuals with nonalcoholic nonfatty liver disease (NAFLD) and alcoholic nonfatty liver disease without cirrhosis (ALDNC), and between alcoholic liver disease with cirrhosis (ALDC) and alcoholic liver disease without cirrhosis ALDNC. All of the ML models yielded comparable accuracies, with the RF carrying the advantage of a probabilistic interpretation. There would still be advantages to converting the ML scores to probabilities in this case. For instance, as shown in Malley et al. [64], the probabilities returned by these models – including the LR and RF ones – cannot necessarily be taken at face value. Also, our method acts to normalize the ML results from the four classifiers onto a single, intuitive scale. But, more broadly, there are instances where ML models with nonprobabilistic outputs outperform methods that allow a probabilistic interpretation of the results. For instance, Statnikov et al. [65] compared RF and SVM models for microarraybased cancer classification, finding that SVM models consistently outperformed RF models.
Conclusions
A novel nonparametric Bayesian technique is proposed for calibrating the outputs of an MLbased algorithm to a probability. The method’s generalizability was demonstrated by applying it to two disparate ML classifier discriminants: an SVM discriminant and an arbitrarily defined kmeans discriminant. In applying this method to these classifiers over a diverse array of real and simulated data sets, it was shown to yield a broader, more dynamic range of calibrated probabilities with a finer granularity, especially when discrimination between the classes is poor. This provides more nuanced diagnostic and prognostic probabilistic assessments from MLbased clinical decision support systems, allowing clinicians and patients to make better decisions. Therefore, converting ML outputs to probabilities substantially improves clinical decision making.
Although the focus of this work has been calibrating ML scores, there is no reason why the output necessarily needs to be derived from a machine. It can easily be extended to calibrate any clinical score (e.g., a psychiatric rating scales, illness severity scores, etc.), where the prior on α _{L} goes as 2^{−L} if the scores are discrete [46].
In future work, methods of generalizing this formalism to multiclass problems will be explored. This is not a trivial undertaking, as many scores may need to be combined to calculate a posterior probability. Other future research directions will include understanding how the Bayesian formalism might be leveraged to include hypotheses which assume that the new (test) point X _{p} is not derived from either of the parent class distributions.
Abbreviations
 BBQ:

Bayesian binning in quantiles
 IR:

Isotonic regression
 LOO:

Leave one out
 LR:

Logistic regression
 ML:

Machine learning
 STM:

Suicide thought markers
References
Hunt DL, Haynes RB, Hanna SE, Smith K. Effects of computerbased clinical decision support systems on physician performance and patient outcomes: a systematic review. JAMA. 1998;280(15):1339–46.
DemnerFushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomedical Inform. 2009;42(5):760–72.
Garg AX, Adhikari NK, McDonald H, RosasArellano MP, Devereaux P, Beyene J, et al. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: a systematic review. JAMA. 2005;293(10):1223–38.
Jaspers MW, Smeulers M, Vermeulen H, Peute LW. Effects of clinical decisionsupport systems on practitioner performance and patient outcomes: a synthesis of highquality systematic review findings. J Am Med Inform Assoc. 2011;18(3):327–34.
Dexheimer JW, Johnson LH, Solti I, Aronsky D, Pestian JP. Pediatric biomedical informatics. In: Informatics and decision support: Springer; 2012. p. 193–209.
Kidd M, Purves I. Evidencebased practice in primary care; 2001.
Connolly B, Faist R, West C, Holland KD, Matykiewicz P, Glauser TA, et al. A statistical approach for visualizing the quality of multihospital data. Visible Lang. 2014;48(3):68.
Pestian J, Matykiewicz P, HollandBouley K, Standridge S, Spencer M, Glauser T. Selecting antiepileptic drugs: a pediatric epileptologist’s view, a computer’s view. Acta Neurol Scand. 2013;127(3):208–15.
Glauser TA, Wenstrup RJ, Vinks AA, Pestian J. Optimization and individualization of medication selection and dosing: Google Patents; 2013. US Patent App. 14/053,220
Kaushal R, Shojania KG, Bates DW. Effects of computerized physician order entry and clinical decision support systems on medication safety: a systematic review. Adv Intern Med. 2003;163(12):1409–16.
Walton R, Dovey S, Harvey E, Freemantle N. Computer support for determining drug dose: systematic review and metaanalysis. BMJ. 1999;318(7189):984–90.
Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care. Lancet. 2003;362(9391):1225–30.
Matykiewicz P, Cohen KB, Holland KD, Glauser TA, Standridge SM, Verspoor KM, et al. Earlier identification of epilepsy surgery candidates using natural language processing. ACL. 2013:1.
Standridge S, Faist R, Pestian J, Glauser T, Ittenbach R. The reliability of an epilepsy treatment clinical decision support system. J Med Syst. 2014;38(10):1–6.
Cohen KB, Glass B, Greiner HM, HollandBouley K, Standridge S, Arya R, et al. Methodological issues in predicting pediatric epilepsy surgery candidates through natural language processing and machine learning. Biomed Inform Insights. 2016;8:11.
Pestian JP, Glauser TA, Matykiewicz P, Holland KD, Standridge SM, Greiner HM, et al. Identification of surgery candidates using natural language processing: Google Patents; 2014. US Patent App. 14/908,084
Simon P. Too big to ignore: the business case for big data, vol. 72: Wiley; 2013.
Tan AC, Gilbert D. An empirical comparison of supervised machine learning techniques in bioinformatics. In: APBC, vol. 19: Australian Computer Society, Inc., 2003. p. 219–22.
Zhou X, Xu J, Zhao Y. Machine learning methods for anticipating the psychological distress in patients with alzheimer’s disease. Australasian Physics & Engineering Sciences in Medicine. 2006;29(4):303.
Silva S, Peran P, Kerhuel L, Malagurski B, Chauveau N, Bataille B, et al. Brain gray matter mri morphometry for neuroprognostication after cardiac arrest. Crit Care Med. 2017;
Plumb A, Grieve F, Khan S. Survey of hospital clinicians’ preferences regarding the format of radiology reports. Clin Radiol. 2009;64(4):386–94.
Brundage MD, Smith KC, Little EA, Bantug ET, Snyder CF, et al. Communicating patientreported outcome scores using graphic formats: results from a mixedmethods evaluation. Qual Life Res. 2015;24(10):2457–72.
Verheul R. Clinical utility of dimensional models for personality pathology. J Personal Disord. 2005;19(3):283.
Eskander MG, Leung A, Lee D. Style and content of ct and mr imaging lumbar spine reports: radiologist and clinician preferences. Am J Neuroradiol. 2010;31(10):1842–7.
Heffner DK, Adair CF. Clarity on the diagnosis line (the devil is in the details). Ann Diagn Pathol. 1999;3(3):187–91.
Center BP. Bipartisan policy center task force on delivery system reform and health it transforming healthcare: the role of health it; 2012. http://bipartisanpolicy.org/sites/default/files/Transforming%20Health%20Care.pdf. Accessed 5 Dec 2016.
Swift L, Miles S, Price GM, Shepstone L, Leinster SJ. Do doctors need statistics? Doctors’ use of and attitudes to probability and statistics. Stat Med. 2009;28(15):1969–81.
Eddy DM. The challenge. JAMA. 1990;263(2):287–90.
Shapiro AR. The evaluation of clinical predictions: a method and initial application. In: Computerassisted medical decision making: Springer; 1985. p. 189–201.
Hopkins WG. Probabilities of clinical or practical significance. Sportscience. 2002;6(201):16.
Grimes DA, Schulz KF. Refining clinical diagnosis with likelihood ratios. Lancet. 2005;365(9469):1500–5.
Wells PS, Anderson DR, Bormanis J, Guy F, Mitchell M, Gray L, et al. Value of assessment of pretest probability of deepvein thrombosis in clinical management. Lancet. 1997;350(9094):1795–8.
Kanis JA, Hans D, Cooper C, Baim S, Bilezikian JP, Binkley N, et al. Interpretation and use of frax in clinical practice. Osteoporos Int. 2011;22(9):2395–411.
Mazur DJ, Hickam DH. Patients’ interpretations of probability terms. J Gen Intern Med. 1991;6(3):237–40.
Edwards A, Elwyn G. Shared decisionmaking in health care: achieving evidencebased patient choice: Oxford University Press; 2009.
Trevena LJ, Barratt A, Butow P, Caldwell P. A systematic review on communicating with patients about evidence. J Eval Clin Pract. 2006;12(1):13–23.
Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers. 1999;10(3):61–74.
Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: ACM SIGKDD international conference on knowledge discovery and data mining: ACM; 2002. p. 694–9.
Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: ICML, vol. 1; 2001. p. 609–16.
NiculescuMizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on machine learning: ACM; 2005. p. 625–32.
Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of AAAI, vol. 2015: NIH Public Access; 2015. p. 2901.
Borgwardt KM, Ghahramani Z. Bayesian twosample tests. arXiv preprint arXiv:09064032. 2009;
Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008:859–74.
Pennell ML, Dunson DB. Nonparametric bayes testing of changes in a response distribution with an ordinal predictor. Biometrics. 2008;64(2):413–23.
Bhattacharya A, Dunson D. Nonparametric bayes classification and hypothesis testing on manifolds. J Multivar Anal. 2012;111:1–19.
Holmes CC, Caron F, Griffin JE, Stephens DA, et al. Twosample bayesian nonparametric hypothesis testing. Bayesian Anal. 2015;10(2):297–320.
Hochhauser, Mark. Risk overload and risk misdirection in the consent process; https://www.socra.org/publications/pastsocrasourcearticles/riskoverloadandriskmisdirectionintheconsentprocess/. Accessed 5 Dec 2016.
The University of Tennessee Chattanooga. Informed consent requirements; https://www.utc.edu/researchintegrity/institutionalreviewboard/informedconsent/. Accessed 5 Dec 2016.
Royal College of Obstetrians and Gynaecologists. Clinical governance advice no. 7;.https://www.rcog.org.uk/globalassets/documents/guidelines/clinicalgovernanceadvice/cga715072010.pdf. Accessed 5 Dec 2016.
Government of Western Australia Department of Health. Integrated corporate and clinical risk analysis tables and evaluation criteria; http://ww2.health.wa.gov.au/^{∼}/media/Files/Corporate/general%20documents/Quality/PDF/WA risk analysis tables.ashx. Accessed 5 Dec 2016.
Conroy R, Pyörälä K, AP F, Sans S, Menotti A, De Backer G, et al. Estimation of tenyear risk of fatal cardiovascular disease in Europe: the score project. Eur Heart J. 2003;24(11):987–1003.
Sarnak MJ, Levey AS, Schoolwerth AC, Coresh J, Culleton B, Hamm LL, et al. Kidney disease as a risk factor for development of cardiovascular disease a statement from the american heart association councils on kidney in cardiovascular disease, high blood pressure research, clinical cardiology, and epidemiology and prevention. Circulation. 2003;108(17):2154–69.
Kanis JA. Diagnosis of osteoporosis and assessment of fracture risk. Lancet. 2002;359(9321):1929–36.
Ferguson TS. Prior distributions on spaces of probability measures. Ann Stat. 1974:615–29.
Hartmann HC, Pagano TC, Sorooshian S, Bales R. Confidence builders: evaluating seasonal climate forecasts from user perspectives. Bull Am Meteorol Soc. 2002;83(5):683.
MH DG, Fienberg SE. The comparison and evaluation of forecasters. The Statistician. 1983:12–22.
Asuncion A, Newman D. UCI machine learning repository; 2007. http://www.ics.uci.edu/^{∼}mlearn/{MLR}epository.html. Accessed 5 Dec 2016.
Lichman M. UCI machine learning repository; 2013. http://archive.ics.uci.edu/ml.
Pestian J, Matykiewicz P, Cohen K, GruppPhelan J, Richey L, Meyers G, et al. Suicidal thought markers: a controlled trial examining the language of suicidal adolescents. In: American association of Suicidology annual conference. Austin; 2013.
Pestian JP, Sorter M, Connolly B, Bretonnel Cohen K, McCullumsmith C, Gee JT, et al. A machine learning approach to identifying the thought markers of suicidal subjects: a prospective multicenter trial. Suicide Life Threat Behav. 2016;
Wilcox R. Kolmogorov–smirnov test. Encyclopedia of biostatistics. 2005.
Ghalanos A. bbq: Bayesian binning into quantiles;. R package version 0.1.0.
Sowa JP, Atmaca Ö, Kahraman A, Schlattjan M, Lindner M, Sydor S, et al. Noninvasive separation of alcoholic and nonalcoholic liver disease with predictive modeling. PLoS One. 2014;9(7):e101444.
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2012;51(1):74.
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification. BMC bioinformatics. 2008;9(1):319.
Hong ZQ, Yang JY. Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recogn. 1991;24(4):317–24.
Cios K. Spect heart data set from the UCI machine learning repository. http://mlr.cs.umass.edu/ml/datasets/SPECT+Heart. Krys.Cios@cudenver.edu. Accessed 5 Dec 2016.
Kurgan L.. Spect heart data set from the UCI machine learning repository. http://mlr.cs.umass.edu/ml/datasets/SPECT+Heart. Accessed 5 Dec 2016.
Little MA, McSharry PE, Roberts SJ, Costello DA, Moroz IM. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed Eng Online. 2007;6(1):1.
Guyon I, Gunn S, BenHur A, Dror G. Result analysis of the nips 2003 feature selection challenge. In: Advances in neural information processing systems; 2004. p. 545–52.
Guvenir HA, Acar B, Demiroz G, Cekin A. A supervised machine learning algorithm for arrhythmia analysis. In: Computers in cardiology 1997: IEEE; 1997. p. 433–6.
Street WN, Wolberg WH, Mangasarian OL. Nuclear feature extraction for breast tumor diagnosis. In: IS&T/SPIE’s symposium on electronic imaging: science and technology: International Society for Optics and Photonics; 1993. p. 861–70.
Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. Oper Res. 1995;43(4):570–7.
Lim TS, Loh WY, Shih YS. A comparison of prediction accuracy, complexity, and training time of thirtythree old and new classification algorithms. Mach Learn. 2000;40(3):203–28.
Acknowledgements
Leslie Korbee provided copy editing and advice on presentation of results.
Funding
This work was supported by the Cincinnati Children’s Hospital Medical Center Department of Neurosurgery, and the Division of Biomedical Informatics, Department of Pediatrics, University of Cincinnati College of Medicine.
Availability of data and materials
The biomedical datasets generated and/or analysed during the current study are available in the http://archive.ics.uci.edu/ml/repository, http://archive.ics.uci.edu/ml/ [58]. Only the datasets generated and/or analysed during the suicide studies are not publicly available due privacy concerns.
Author information
Authors and Affiliations
Contributions
BC conceptualized the project and developed the novel methodology and analysis. JP and KBC helped conceptualize and revise the manuscript critically for important intellectual content. DS and UB helped revise the manuscript critically for important intellectual content. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The clinical (suicide) data in this study was collected, analyzed and published under protocols 2008–1421 and 2013–3770, which were reviewed and approved by the Cincinnati Children’s Institutional Review Board.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Connolly, B., Cohen, K.B., Santel, D. et al. A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support. BMC Bioinformatics 18, 361 (2017). https://doi.org/10.1186/s1285901717363
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901717363
Keywords
 Statistics
 Nonparametric
 Bayesian
 Calibration
 Machine learning