A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support

Connolly, Brian; Cohen, K. Bretonnel; Santel, Daniel; Bayram, Ulya; Pestian, John

doi:10.1186/s12859-017-1736-3

Table 1 Description of the data sets obtained from the University of California, Irvine Machine Learning repository, including a brief description and the number of cases and controls in the training and testing sets used to demonstrate the proposed method

From: A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support

Data set	Description	Number of training Cases/Controls	Number of test Cases/Controls	Number of features	Citations
Lung Cancer	Clinical data, X-ray data, etc. used to predict 3 pathological types of lung cancer. The instances are divided into three classes of 9, 10, and 13 observations. For purposes here, the first two classes are aggregated into a single class.	8/8	11/5	54 integer clinical features	[66]
SPECT	Instances of normal and abnormal cardiac diagnoses.	40/40	172/15	22 binary features indicating partial diagnoses	[67, 68]
Parkinsons	Biomedical voice measurements from 31 people, including 23 with Parkinson’s disease.	72/25	75/23	22 real features	[69]
Arcene	Mass-spectrometric data that can be used to distinguish patients with cancer versus healthy subjects.	44/56	44/56	The data set contains 10,000 integer features; a Kolmogorov-Smirnov test [61] was used to choose the top 268 most discriminating features for classification.	[70]
Arrhythmia	Normal and “abnormal” instances of demographic and electrocardiogram features.	127/99	118/108	278 categorial, integer and real demographic and electrocardiogram features. A Kolmogorov-Smirnov test [61] was used to select the 32 most discriminating features for classification.	[71]
Breast Cancer	This data set contains features from a digitized images of fine needle aspirates (FNA) of breast masses, which describe characteristics of the cell nuclei present in the images. The data set contains benign and malignant instances of real-valued features.	130/219	111/239	8	[72, 73]
Contraception	This data set is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey which samples married women who were either not pregnant or do not know if they were at the time of interview. The aim for the binary classifier constructed in this work is to predict whether or not a woman uses contraception based on their categorical and integer-valued demographic and socio-economic characteristics. The subset contains information for 1473 women, who are sub-divided based on their contraceptive use: no use (629), long-term methods (333), or short-term methods (511). The goal of the classifier is to classify women based on whether or not they use contraception based on categorical and integer-valued demographic and socio-economic characteristics.	423/313	421/316	8	[74]

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us