BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

BMC Bioinformatics

Table 6 The correlations between the training set statistics and BICEPP performance

Training set statistics	Method
	*cdf*	*ctf*	*ctf-icdf*	Stemming	Synonyms
(a) Number of positive examples	-.49	-.50	-.55	-.51	-.53
(b) Sum of article counts	-.39	-.39	-.45	-.40	-.43
(c) Maximum article count	-.20	-.19	-.23	-.20	-.26
(d) Mean article count	.09	.10	.06	.10	.00
(e) Median article count	.18	.17	.17	.16	.20
(f) Minimum article count	.32	.33	.34	.34	.33
(g) Variance of articles counts	-.01	-.00	-.03	-.01	-.09
(h) Skewness of article counts	-.33	-.35	-.36	-.35	-.40

The numbers in the table are Spearman's rank correlation coefficients σ, calculated by correlating the training set statistics with the best AUC obtained from four machine learning algorithms (NB, IBk, SVM with polynomial and RBF kernels) for each of the 272 of 484 drug characteristics with ≥10 positive examples. For each drug, the corresponding article count indicates how many abstracts were retrieved from the MEDLINE database searched by using the drug name. The entries and the category names in boldface indicate p < 0.0001 as determined by the test of rank correlation coefficient using the Fieller-Hartley-Pearson method [37]. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens calculated by retrieving abstracts with both generic and trade names for a given drug.

ISSN: 1471-2105