Skip to main content

Table 1 Description of the data sets obtained from the University of California, Irvine Machine Learning repository, including a brief description and the number of cases and controls in the training and testing sets used to demonstrate the proposed method

From: A nonparametric Bayesian method of translating machine learning scores to probabilities in clinical decision support

Data set Description Number of training Cases/Controls Number of test Cases/Controls Number of features Citations
Lung Cancer Clinical data, X-ray data, etc. used to predict 3 pathological types of lung cancer. The instances are divided into three classes of 9, 10, and 13 observations. For purposes here, the first two classes are aggregated into a single class. 8/8 11/5 54 integer clinical features [66]
SPECT Instances of normal and abnormal cardiac diagnoses. 40/40 172/15 22 binary features indicating partial diagnoses [67, 68]
Parkinsons Biomedical voice measurements from 31 people, including 23 with Parkinson’s disease. 72/25 75/23 22 real features [69]
Arcene Mass-spectrometric data that can be used to distinguish patients with cancer versus healthy subjects. 44/56 44/56 The data set contains 10,000 integer features; a Kolmogorov-Smirnov test [61] was used to choose the top 268 most discriminating features for classification. [70]
Arrhythmia Normal and “abnormal” instances of demographic and electrocardiogram features. 127/99 118/108 278 categorial, integer and real demographic and electrocardiogram features. A Kolmogorov-Smirnov test [61] was used to select the 32 most discriminating features for classification. [71]
Breast Cancer This data set contains features from a digitized images of fine needle aspirates (FNA) of breast masses, which describe characteristics of the cell nuclei present in the images. The data set contains benign and malignant instances of real-valued features. 130/219 111/239 8 [72, 73]
Contraception This data set is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey which samples married women who were either not pregnant or do not know if they were at the time of interview. The aim for the binary classifier constructed in this work is to predict whether or not a woman uses contraception based on their categorical and integer-valued demographic and socio-economic characteristics. The subset contains information for 1473 women, who are sub-divided based on their contraceptive use: no use (629), long-term methods (333), or short-term methods (511). The goal of the classifier is to classify women based on whether or not they use contraception based on categorical and integer-valued demographic and socio-economic characteristics. 423/313 421/316 8 [74]