BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

BMC Bioinformatics

Table 5 Comparative evaluation of cdf-based predictions with other commonly used IR methods

Category	Best AUC	Method
		*cdf*	*ctf*	*ctf-icdf*	Stemming	Synonyms
AMH major classes	> 0.80	20 (100)	20 (100)	20 (100)	19 (95)	20 (100)
	> 0.90	19 (95)	17 (85)	18 (90)	17 (85)	18 (90)
	> 0.95	12 (60)	12 (60)	15 (75)	12 (60)	11 (55)
AMH minor classes	> 0.80	135 (69)	106 (54)*	157 (80)	152 (77)	156 (79)
	> 0.90	123 (62)	100 (51)	150 (76)*	145 (74)	151 (77)*
	> 0.95	114 (58)	92 (47)	144 (73)*	142 (72)*	143 (73)*
AMH adverse events	> 0.80	159 (67)	148 (62)	173 (73)	153 (64)	155 (65)
	> 0.90	86 (36)	88 (37)	100 (42)	84 (35)	84 (35)
	> 0.95	41 (17)	42 (18)	55 (23)	44 (18)	44 (18)
PKIS perpetrator	> 0.80	7 (47)	6 (40)	7 (47)	12 (80)	10 (67)
	> 0.90	5 (33)	5 (33)	4 (27)	3 (20)	4 (27)
	> 0.95	2 (13)	2 (13)	2 (13)	3 (20)	3 (20)
Narrow therapeutic index drugs	> 0.80	9 (64)	9 (64)	11 (79)	10 (71)	10 (71)
	> 0.90	8 (57)	7 (50)	10 (71)	9 (64)	10 (71)
	> 0.95	5 (36)	6 (43)	9 (64)	9 (64)	10 (71)
Overall	> 0.80	330 (68)	289 (60)*	368 (76)*	346 (71)	351 (73)
	> 0.90	241 (50)	217 (45)	282 (58)*	258 (53)	267 (55)
	> 0.95	174 (36)	154 (32)	225 (46)*	210 (43)	211 (44)

The numbers in this table indicate the number of characteristics (percentage) achieved an AUC above the given thresholds in stratified cross-validation evaluations. For each method, the results from the best of 4 algorithms were compared. The thresholds of AUC can be interpreted as good (> 0.8), very good (> 0.9), and excellent (> 0.95) respectively. The entries labelled (*) indicate a significantly better or worse performance than cdf for predicting drug characteristics. Fisher's exact tests were applied as 2 × 2 tables with α = 0.05 adjusted for a family of four comparisons by using the Bonferroni method. The numbers in boldface indicate the best performing method(s) for each characteristic category above the AUC = 0.8 threshold. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens calculated by retrieving abstracts with both generic and trade names for a given drug.

ISSN: 1471-2105